Data Linking for a Noisy, Huge Dataset
Urvin was approached by a firm that specializes in organizing huge amounts of data for clients all around the world. This data is primarily bibliographic in nature, including information about books, periodicals and authors. These are small pieces of data without substantive descriptions other than names and titles. Not only is the data noisy, but the dataset is huge. This company had embarked on a project to link all of the various pieces of data together – authors with name variations and aliases, titles of their works, including in multiple languages, substring matching of titles in sentences about those works, and linking authors to their genre, co-authors, dates, medium and language. This company could only consider solutions that could process petabytes of data with a reasonable hardware footprint and processing time. They had exhausted their open source options, and concluded that no “state of the art” open source offering could accommodate their requirements.
Urvin’s AI team took a look at the dataset, and quickly understood why current open source solutions were struggling with the business problem. The first challenge was the small bits of data, which potentially contained noise that prevented exact string matching. The second challenge was related to the ambiguity of the data, where a single name could refer to tens or hundreds of potential authors. The third challenge was in managing derivative works, such as translations or commentaries. And the final challenge was managing all of these problems at petabyte scale. The client had seen several demonstrations of NLP platforms that were generic in nature, but had never been shown a demonstration that could operate on their difficult data, at scale. Several solutions looked good on laboratory-sized, clean data, but failed spectacularly when applied to extremely big, noisy datasets.
Urvin’s unique natural language engine has several features that make it ideal for this kind of problem:
- Extremely high throughput for string matching, orders of magnitude better than current “state of the art.” These algorithms:
- scale linearly with the length of the input text
- match in nearly constant time regardless of vocabulary size
- are computationally very efficient (low time multiplier)
- have a low memory footprint (much less than needed to store vocabulary)
- The ability to build custom, semantic graphs using client data, which allows us to take advantage of our novel disambiguation algorithms to facilitate named entity recognition and entity linking, at scale.
Urvin’s software development team developed custom connectors to multiple data sources that were able to ingest, extract and transform that data into a common, intermediate data format to facilitate builds of our proprietary data structures. We were able to then leverage our novel data structures and algorithms to meet the client’s performance and accuracy requirements for both string matching and entity disambiguation/linking:
- Our string matching performance was just over 500 nanoseconds per string, resulting in slightly less than 2 million matches per second.
- Combining string matching and entity disambiguation / linking took around 2 microseconds per match-and-link, resulting in slightly less than 500 thousand match-and-links per second, with 91% accuracy.
Both of these benchmarks were hit on a small hardware footprint that could easily be scaled horizontally.
There’s no better way to summarize how well our solution worked than to quote the client: “What I’ve seen is impressive. This is the first actual demonstration of viability that I’ve seen from any company that can take our data and work at our scale. This is remarkable speed and remarkable quality.” Urvin’s unique natural language technology combined with our in-depth understanding and knowledge of building AI at scale provided the solution this client needed, and the only viable solution they had seen – commercial or open source.