The Documaster Technology Stack
I recently realized that we have spent five years at Documaster researching and making technology work for so many customers, and yet, very few people outside the company know much about our technology stack.
In this post, I will describe our stack and point out why we believe we made the right choices when it comes to technology – choices that not only make Documaster a superb product in its own right but also a product that integrates nicely with various third-party applications.
A technology stack consists of the programming languages and software components (including their layering, interactions, etc.) used to develop and run a software system. When discussing products such as Documaster, which is hosted on a server and accessed by end users from their computers or mobile devices, a technology stack should be split into two – an application stack and an infrastructure stack. The application stack focuses on the programming languages and libraries used for development as well as the runtime components required by the software. The infrastructure stack deals with how the software is installed, upgraded, monitored, secured, and how backups are managed.
Deciding on a technology stack is never easy, especially when developing complex software. It is worth noting that having a team with a broad range of programming and infrastructure experience certainly expands the possible technology choices. At Documaster, we always follow a few guiding principles when selecting technology: 1) we must be free to use any technology if it is a good fit for the problem we are trying to solve. 2) we must have full control over supporting the non-functional requirements of our software (performance, scalability, reliability, etc.) 3) we stick to widely adopted technology with rich and mature ecosystems, but 4) we will also adopt cutting-edge but production-ready technology when our customers and products will benefit from it.
Documaster runs on Linux. This should not surprise anyone considering the fact that Linux and other Unix-like operating systems are used by the majority of servers on the Internet, and Linux is quite a popular choice for intranet servers too. However, popularity is not the only reason why we chose Linux over Windows Server and other alternatives. It was primarily because of the vast collection of rock-solid open-source libraries, tools, and other software components Linux has to offer. We picked the Ubuntu Server LTS Linux distribution, which comes with five years of security updates and generally includes newer versions of tools and libraries than other popular server distributions.
Our databases of choice are MongoDB for storing non-relational data and PostgreSQL for storing relational data. MongoDB is probably the most used NoSQL database and undoubtedly one of the most mature. Being a relational database, PostgreSQL has a much longer history during which it has had to compete against many serious rivals. Despite this, it has managed to become the fourth most popular relational database in the world. PostgreSQL will likely not be in the top three most widely used relational databases (1st – Oracle, 2nd – MySQL, 3rd – SQL Server) anytime soon, but it definitely offers a great mix of performance, reliability and cost compared with its main competitors. Oracle was never a viable option for us due to its licensing model, which is incompatible with what we do. We considered using MySQL until PostgreSQL outperformed it heavily during the tests we conducted. SQL Server simply did not run on Linux until SQL Server 2017 and thus was not an option for us. However, we will keep a close eye on its future development. Note that we use object-relational mapping frameworks but have decided to stick to specific databases to be able to optimize our software better, while ensuring that Documaster is not too tightly coupled with any particular database.
There are two main full-text search options these days – Solr and Elasticsearch. Both are based on the Lucene library, scale nicely, and do their job pretty well. We chose Solr because it is the more mature of the two technologies.
Documaster can be configured to scan all incoming documents for viruses using ClamAV, but it is possible to use other antivirus software too.
The backend components of Documaster are responsible for document processing, data flow orchestration, and records management. All are written in Java 8 and run in OpenJDK. We picked Java because of its amazing and thriving ecosystem of libraries, tools, and frameworks. Probably for the same reasons, Java is currently the most used programming language in the world. The choice of technology for the frontend was quite different. The GUI of Documaster is written in React, and we use PHP to develop a simple middleware with support functions for the GUI.
Documaster uses the OpenID Connect and OAuth 2 standards for authentication and authorization (as opposed to the much older and noticeably popularity-declining SAML 2) and supports a wide variety of user directories and identity providers. Some of our customers use Documaster IDP (an identity provider developed by us) with a built-in user directory. Others use Documaster IDP with their organizational Active Directory acting as a user directory. Customers who rely on Office 365 for their daily tasks can use Azure AD as an identity provider and user directory, essentially getting single sign-on between Documaster and Office 365.
While acting as a central repository for the most important documents in an organization, Documaster allows its end users to reuse the knowledge kept in the repository. The more documents in the repository, the easier this is, but many valuable documents are created and exist outside of Documaster in different desktop and cloud business systems and office productivity suites. We have developed several software components to help our customers capture these documents: a data flow orchestration component, which can retrieve and archive documents from live business systems; an Open API (along with client libraries and sample code on GitHub), which enables other systems to send and receive data from Documaster; a Microsoft Office add-in, which helps users archive documents from Microsoft Office 2013 and 2016; and of course our brand new Office 365 add-in, which will allow users to do their daily tasks (document management, approvals, etc.) in Office 365 and have the most valuable documents and metadata archived in Documaster either manually or automatically.
Documaster can be installed pretty much everywhere. A small portion of our customers prefer to run the software in VMware or Hyper-V virtual machines hosted in their private clouds. In these cases, we take care of installing and upgrading Documaster so that the customer’s IT department can focus solely on monitoring, managing backups, and other infrastructure-related tasks. Documaster can also run in every public cloud such as Azure, Amazon, and DigitalOcean. However, the best environment for Documaster is undeniably our very own cloud where we take care of everything. With this in mind, it is not surprising that most of our customers choose this option. Documaster Cloud is not only optimized to run our software but offers a range of features and services that are not available in many public and private cloud offerings.
Documaster Cloud runs on physical hosts situated in several standards-compliant, well-secured and modern data centers in Norway and Sweden. The physical data storage used by the hosts consists of large arrays of SSD/NVMe and/or HDDs. All disks that store sensitive data are encrypted with LUKS (software-based encryption on Linux) to make it impossible for data to be accessed if the disks get stolen (the probability of which is tiny bearing in mind the physical security present at the data centers).
Each physical host runs Linux with two key components installed – LXD and ZFS. LXD is a container management framework that aims to host a complete Linux operating system (as opposed to Docker, which focuses on application containment). While container technology has an almost forty-year history, it only gained significant traction in the last few years, reaching a point where containers are now a great lightweight alternative to virtual machines in many cases. LXD containers are a lot less resource-intensive than a virtual machine because they share the host’s Linux kernel and do not need a hypervisor to do hardware emulation. LXD allows resources such as CPU, RAM, disk space and disk IO bandwidth to be dynamically allocated. The greatest thing about LXD, however, is the fact that it supports ZFS natively.
ZFS is a technology that finally unified in a single package the three key aspects of data storage – file system, disk redundancy, and logical volume management. It was introduced in Linux from Open Solaris in 2006 and got its first stable Linux release seven years later. ZFS supports file systems of up to (theoretically) 256 trillion yobibytes, works with different types of disks (HDD, SSD, and NVMe) and conducts continuous integrity checks and automatic repairs. It supports several different redundant setups, which allow us to optimize the disk arrays for integrity as well as for read/write speed based on the scenario at hand. ZFS uses the copy-on-write transactional model, thus enabling the extremely fast creation of (consistent) snapshots and cloning of file systems. From an LXD perspective, this means that ZFS makes it possible to create a snapshot of each container in our cloud every few minutes, and quickly copy the snapshot to a backup server in another data center, where we end up with an identical copy of the original container. Easily moving containers between hosts is also a major strength of LXD.
We have always wanted to create and maintain Documaster installations in our cloud in exactly the same way, regardless of customer size and other specifics. To achieve this, we use Ansible for server provisioning and automating the software deployment process. Ansible also allows us to easily keep all customers in our cloud on the latest version of Documaster and be sure of exactly how each container and physical host is configured. We use Icinga 2 to monitor various aspects of the physical hosts, network, containers, and applications, and also to notify us in case of issues. As previously mentioned, backups are managed (mostly) using ZFS snapshots.
Documaster is hosted on shared hosts for small and medium-sized customers. For large customers, we typically use one or more dedicated hosts, including dedicated backup locations in another data center. We also offer several extra options upon request, the most popular of which are a dedicated private subnet and site-to-site VPN tunnel. When combined, these two create an environment completely sealed off from all other environments in our cloud and the outside world, and which is only accessible from the customer’s premises.
In this post, I briefly describe our technology stack and explained what guidelines we follow when selecting technology. We spent five years researching and trying out different technologies, and making many choices along the way. Some of these were pretty standard and easy, while others were far from mainstream and therefore more difficult. We did this to develop a modern product that is powerful and easy to use, and which integrates nicely with some of the most popular business systems and office productivity suites on the market. We also envisioned almost from day one creating our own cloud, where we can host Documaster and offer it as a service to our potential customers to lower their software costs. Behind all this stands the amazing team at Documaster. Rest assured that not a single day passes without us thinking about how to improve our technology stack and therefore the product and its underlying cloud infrastructure.