Documaster’s Commitment to AI Readiness

Documaster’s Commitment to AI Readiness

08-Nov-2019 12:56:00

We believe that the business value of AI does not lie in its ability to amaze you at the beginning. Its real value comes when it can help you work smarter by automating repeatable, but time-consuming tasks and freeing your hands to focus on what really matters for your business, without standing in your way. We’ve looked at ways to leverage client data to achieve those purposes in the context of our products.

Exponential advances in computing technology from the past 10-15 years and an omnipresence of networked devices have caused artificial intelligence and machine learning to become household names. While less than two decades ago society could only theorize about the potential good or harm inherent in a “thinking” machine we seem now to have accepted that artificial intelligence is the de-facto state of any global Internet service, be it a social network, a media provider or an e-commerce giant. We either suspect that such services learn from our data (or we find out from a breaking-news report that such has been the case), or are being told explicitly that turning on data collection will make the service better. A growing number of multinationals have embraced the stance that “data is the new oil” and act on the assumption that commanding vast quantities of real-world data gives a real competitive advantage.

While generally true, each of the foregoing statements is based on an oversimplified view of the current state of technology and the nature of learning from data. Because third-party data is the reason Documaster exists, we have also been working on a data-driven development strategy for our products since the start. And we have come a long way from assuming that training a neural network on a client’s data would bear immediate fruit, making us the next Silicon-Valley AI darling. With a growing number of clients relying on our services for fast, intuitive and secure handling of their most valuable documents, this strategy has served us and our clients well. We believe that going forward our clients’ success will continue to be largely dependent on our approach to developing data-driven product features and would therefore like to outline it in this article.

 The Promise of AI

In the race to win market shares it is tempting to overpromise on the effects of leveraging data into one’s product or service. In a client-consultant relationship this would typically mean that a new “smart” feature and a road map is agreed upon, with the actual implementation sometime in the not so near future, after all the consultancy hours have been billed. Being a SaaS product company, overpromising was never an option for Documaster since what is being delivered is the existing product in the standard case. Client-driven change requests are limited in scope and must be aligned with the needs of the broadest client base possible, where existing clients take the topmost priority.  There is very little room for overstating the capabilities of an existing product, and overpromising on a fantastic coming feature would put at risk not just one but the majority of our existing client relationships.

We have grown to realize, however, that being limited in what we can promise does not free us from the responsibility to ensure that our product will be ready at any time in the future to leverage the actual data it stores and manages to drive functionality and user experiences. We are also perfectly aware that harnessing the power of AI is more than applying the latest tech- or data-science skill. In fact, there seems to be concensus across the data-science community that making sure one has the correct training data for supervised learning by collecting and cleaning it certainly matters just as much as your choice of machine learning algorithms. It could hardly be otherwise since the input to any supervised machine learning algorithm is the ground truth forming the foundation of the abstract world in which the machine operates.

Documaster’s Commitment to AI Readiness

Data Quality

To ensure that our product and its clients are AI ready we’ve developed a dedicated work flow to allow the client to determine exactly what documents are important to their organization and should therefore make it to long-term storage, weeding out and the duplicates at source. Documaster’s Assistant applies well-understood information-retrieval techniques to assign relevance scores to every candidate record before it is committed to the Documaster database. With this check in place we ensure that any part of the database can serve as the ground truth for training a model as it will contain validated data relevant to the specific client . We’ve mainly operated into few verticals (such as shipping, telecommunications, legal documentation) but even within the same vertical, client needs differ widely. However, with a minimum configuration effort the client can customize the discriminating engine so their Documaster instance can filter what’s important to them only.

In a typical machine-learning experiment real-world data would be collected first over a longer period of time, then sent for cleaning and preparation to a data science team that is often removed from the business process generating the data. The business process would not aim to provide valid training data tailored specifically to the needs of the experiment so such data is bound to present anomalies. In this scenario anomalies need to be investigated and eliminated outside of the business process itself, and long after the data was generated.  Documaster’s data model has evolved around the strict requirements of the NOARK5 standard. Because of the need to provide a standards-compliant service we have developed a safety harness around the client’s data that prevents many of the conditions one has to guard against when preparing training data . By design, Documaster will not allow anomalies in the database in the first place, so anonymous, out of place records and metadata, circular references, or inadvertent duplication should be resolved already at ingestion time.

Data Availability and Consistency

Once collected, the ground truth is to be made available for training, validating and testing.  Developing data-driven functionality is an iterative process of data preparation, feature extraction, feature engineering, model training, validation and testing, where the best model trained on the most efficient features extracted from or computed over the ground truth will make it to production. The modern development effort is typically distributed across vast geographies and multiple time zones and it is common for several teams to be working on the same problem in a large data-driven project simultaneously. It then becomes crucial for the engineering team to have the assurance that the data they’re experimenting with is the same as the data any other team contributing to the same effort have. To put it simply, in order to be successful, engineering teams must compare apples to apples. Interestingly enough, this simple requirement turns out to be one of the hardest to fulfill.

There is a general trend towards offloading CPU- and memory-intensive tasks such as training a model to dedicated cloud infrastructures, public or private. In order for the training job to complete in reasonable time, data pipeline components are highly parallelized and communication takes place across the network. In such settings it becomes ever more difficult to ensure that a single source of truth is being used unless the ground truth itself is exposed in the form of a network service. The alternative is exchanging copies of the ground truth and implementing complex checks and validations to ensure consistency across all parallel nodes in the model-training pipeline. Documaster is delivered on the cloud as standard, exposing clean, well-defined APIs serving data and metadata. As such, it is ready to serve as the single source of truth regardless of the complexity of the model-training job.

Leveraging AI Beyond the Proof-of-Concept Stage

It should be made clear that no one-size-fits-all AI solution exists. Any solution claiming to be making AI-informed decisions is in fact trained and tuned to perform one or several highly specialized tasks that only make sense in a specific context. Even when leveraging cutting-edge transfer-learning techniques one cannot apply a model trained to recognize human faces straight away to the task of separating cats from dogs, and expect meaningful outcomes. Just like any automation development task, the task of configuring and training a ML model takes a long time and a lot of focused effort, and, above all, access to sufficient client data of high quality. In the absence of all the necessary resources, vendors fall back on highly curated data sets when demonstrating “smart” capabilities, delivering predictable and impressive performance on the pre-sales stage. Adapting the successful proof-of-concept to the client’s use case afterwards is a painful process that more often than not delivers discouraging results.

Zero-Effort Smart Capabilities

We believe that the business value of AI lies not in its ability to amaze you at the start. Its real value comes when it can help you work smarter by automating repeatable but time-consuming tasks and freeing your hands to focus on what really matters for your business, without standing in your way. We’ve looked at ways to leverage client data to achieve those purposes in the context of our products. Our latest generation of document management tools comes with the ability to classify your documents along a virtually unlimited number of dimensions, and to define multifaceted searches across those in an intuitive way, owing to an exceptional, consumer-grade UX. We’ve realized, however, that this alone is not enough to motivate everyone across the organization to file their important documents correctly, applying the necessary classifications.

To mitigate that we’ve introduced a non-binding suggestion service to pre-populate the most likely classifications for any new content created in the system. It is based on a highly configurable modular architecture allowing us great flexibility in the choice of suggesting mechanisms depending on the stage of maturity of the specific Documaster database. The client can benefit from this feature already from the start, without the need to allocate resources to train a model first. As the system matures and sufficient data has been gathered from the actual usage of the system in production, a model kicks in to complement the suggestions. It goes without saying that the more you use it, the more accurate its suggestions, therefore the fewer the corrections you’ll have to make to the suggested classification, and the better Documaster will serve your purposes in the long run.

Are you AI-ready?

We know what it takes to change an existing process in the enterprise. And we won’t try to force you to make a change today. Instead, we give you a robust, scalable archival and case-management solution integrating seamlessly into your existing process. One you can put to productive use immediately. And it will be ready to serve you smart capabilities when you and your organization are ready.

A Great Offer, Just a Click Away

Lorem ipsum dolor sit amet, consectetur adipiscing elit

Subscribe by Email

No Comments Yet

Let us know what you think