Data Center Technology revisited 2014: Open Source is everywhere

February 27, 2014

After working in software development and data center projects for years, I would like to describe the state of the art of deploying and running software in the data center.

This article follows the typical data center stack, starting at the server hardware and storage level, continuing with the hypervisor, operating system, the databases, the application layer, the web server and the network firewall and load balancers. The graphic below shows how far the data center has been transformed by FOSS (Free and Open Source Software). Most parts are FOSS (green), some are mixed (yellow) and small parts are proprietary (red), which will not turn green for the foreseeable future.

alt text

Lifecycle of a Data Center

Most established companies follow the hardware exchange cycle of their hardware vendors. This means that changing the host or basic infrastructure (like network and storage structure) takes one or two generations of hardware. As a consequence, redefining the vendor strategy of a data center is possible only at the end of a hardware life cycle and at the beginning of a new one. This can all add up to somewhere between four and eight calendar years.

To escape this lock-in situation, Amazon-like services became very popular, allowing the IT department to secure and maintain servers “in the cloud”. Sometimes this means setting up servers opportunistically.

Network components like routers, switches and firewalls are now (early 2014) mostly expensive and proprietary products at the high end, sometimes with a Linux system steering logic implemented in programmable hardware.

The situation is different for startups or small and medium traffic sites where a Linux router is perfectly able to handle all the traffic

Hardware and Virtualisation, Storage

With most servers in the data center underutilized, virtualization promised to deliver more applications, meaning more business on less iron. However, virtualization did not make the situation in a data center less complicated. It is an additional level which has to be understood so it may be rolled out, monitored and maintained effectively.

The free Xen hypervisor was initially the market leader, but later lost out on its dominance to VMWare. However, with QEMU-KVM smoothly integrated into the Linux kernel and projects like OpenNebula or OpenStack giving everybody easy access to resources, the situation is again going in the direction of open. If you can manage your virtualization layer with automatic configuration tools like Puppet or Chef, you can now set up a cloud in a matter of days, not months.

Storage components in a data center were notoriously proprietary. You paid a lot of money for an amount of storage that you can normally get for a few hundred bucks in a local computer shop because performance, mirroring, backup and management (including training of the staff) easily added up to an overhead of a million bucks a year. Here we can foresee the next dramatic change. Cluster technologies like Ceph or GlusterFS look very promising, replacing the pain of today’s storage environments. It is not hard to predict that the next generation of data centers will consist of commodity hardware using free cluster technology.

The flexible replication strategy creates failure tolerance and scalability in a single step. The disadvantage of this approach is the network load, demanding 10GBit/s as standard and utilizing a full network only for block replication.

Databases

Traditional database administration is very conservative. Reliability is even more important than performance. With databases like PostgreSQL and MySQL (or its fork MariaDB), you have the choice between two very mature open source products. The only mistake I have seen so far is to design free database servers similarly to traditional hosts, where you pay per core. In this case you have to maintain a host, which, compared to the license costs of proprietary systems, is much more expensive. Fortunately, the free licenses allow you to set up as many free servers as you like, and by automating the setup you get new databases in a few minutes. Rolling out hundreds of small DB servers is no longer a problem in a fully automated environment.

There are new kids on the block overthrowing the traditional SQL paradigm. Last count showed 130 NoSQL databases with a focus on document management such as MongoDB or graph structures like Neo4J. There are plenty of other databases for every special purpose, the most important of which are fully integrated into Java via Spring. There is even a “foreign data wrapper” for PostgreSQL integrating MongoDB, so you can have the best of both the SQL and the NoSQL worlds.

Network Components

The core of network components in a data center is special purpose hardware, optimised very thoroughly for high load. Therefore, most of the components do not come under a pure software licence. However, for managing the hardware or configuring sophisticated setups, like a software-defined network, there is a plethora of open source projects. Their functions span all kinds of aspects, from management of single components to entire network layers in OpenStack. In the most advanced network environments, the same configuration management can be used to define the network. In an ongoing project, Endocode defines dozens of networks on top of OpenNebula with puppet giving the ability not only to react immediately on changing demands but also guaranteeing that the configuration is stageable from development to testing environments through to production.

The Software Stack

The pure software application cycles on top of the operating system are much faster. In large environments, FOSS (Free and Open Source Software) has been the de facto standard for years. Most websites run an application stack based either on the Java Virtual Machine using Tomcat, or JBoss running on Linux or LAMP (Linux, Apache, MySQL, PHP) with an endless number of libraries. Trying to simply count the classes in a Java application I found nearly 100,000 within a thousand different libraries.

Other open source products like Ruby on Rails and NGinx are less common, but nevertheless very popular in startups. Even commercial products consist mostly of open source components. We see new major versions at least once a year. If a company starts to follow an enterprise strategy, this often means replacing the startup technologies with Java.

Monitoring and Logging

Management of monitoring and logging is a traditional domain of FOSS. Most know that tools like Nagios or Icinga are widespread and common in most data centers, and this has been the case for the last 10 years. Integration of performance monitoring by Munin or Graphite can be challenging on projects in high load environments. Your proprietary tool may have been introduced because of its ease of use, but you can feel the pain during analysis, since the license limits the amount of data to some tens of GB per month. Tools like Logstash scale well. For more detailed analysis of the treasure trove of data, Hadoop is the tool with the most vibrant community.

Deployment Pipe

The deployment pipe is the conveyor from development to production. In most companies, this is realised by a Jenkins server and a number of small scripts pushing the latest version to the front page. With more applications the number of servers and build systems grows fast. Without the flexibility and scalability of open source, it would be impossible to publish the next version of the web application within hours or even minutes.

Conclusion

With the exception of high end networking hardware, open source is everywhere. It now seems virtually impossible to drive a data center without a majority of components implemented in and connected by open source. This post can only glance at certain market leaders but there are many more around, and selecting the right component in the environment can be a challenging task. However, compared to selecting proprietary components, having to choose among several free and open source components looks a breeze.

About the author

Thomas Fricke is a partner at the Endocde AG and adds more than 25 years of experience in Software Development and Data Center Management to our track record. With his keen understanding of complex systems, he designs environments to build fast, maintainable and reliable automated delivery chains and highly-available distributed database environments.

FOSS Products mentioned in this post:

Virtualisation

KVMOpenStackOpenNebula, XEN

Configuration Management

Chef, Puppet

Storage

Ceph, GlusterFS Network A good overview on Open Source for Software Defined Networks von SDN Central

SQL Databases

PostgresQL, MySQLMariaDB,

NoSQL

MongoDB, PostgresQL MongoDB WrapperNeo4J a very good Overview on NoSQL Databases at nosql-database.org

Application Stack

Java: OpenJDK, Tomcat, JBoss

LAMP: Debian, Ubuntu, Red Hat, Centos, Scientific Linux, Apache, PHP Deployment: Jenkins Other: Python, Ruby, Rails, NGinx

Monitoring and Logging

Nagios, Icinga, Graphite, Check_MK, Logstash, Hadoop