Skip to main navigation Skip to content

Data Linking for a Noisy, Huge Dataset

Urvin Data Linking Use Case

Business Challenge

Urvin was approached by a firm that specializes in organizing huge amounts of data for clients all around the world. This data is primarily bibliographic in nature, including information about books, periodicals and authors. These are small pieces of data without substantive descriptions other than names and titles. Not only is the data noisy, but the dataset is huge. This company had embarked on a project to link all of the various pieces of data together – authors with name variations and aliases, titles of their works, including in multiple languages, substring matching of titles in sentences about those works, and linking authors to their genre, co-authors, dates, medium and language. This company could only consider solutions that could process petabytes of data with a reasonable hardware footprint and processing time. They had exhausted their open source options, and concluded that no “state of the art” open source offering could accommodate their requirements.

Urvin’s Approach

Urvin’s AI team took a look at the dataset, and quickly understood why current open source solutions were struggling with the business problem. The first challenge was the small bits of data, which potentially contained noise that prevented exact string matching. The second challenge was related to the ambiguity of the data, where a single name could refer to tens or hundreds of potential authors. The third challenge was in managing derivative works, such as translations or commentaries. And the final challenge was managing all of these problems at petabyte scale. The client had seen several demonstrations of NLP platforms that were generic in nature, but had never been shown a demonstration that could operate on their difficult data, at scale. Several solutions looked good on laboratory-sized, clean data, but failed spectacularly when applied to extremely big, noisy datasets.

Urvin’s unique natural language engine has several features that make it ideal for this kind of problem:

  • Extremely high throughput for string matching, orders of magnitude better than current “state of the art.” These algorithms:
    • scale linearly with the length of the input text
    • match in nearly constant time regardless of vocabulary size
    • are computationally very efficient (low time multiplier)
    • have a low memory footprint (much less than needed to store vocabulary)
  • The ability to build custom, semantic graphs using client data, which allows us to take advantage of our novel disambiguation algorithms to facilitate named entity recognition and entity linking, at scale.

Urvin’s software development team developed custom connectors to multiple data sources that were able to ingest, extract and transform that data into a common, intermediate data format to facilitate builds of our proprietary data structures. We were able to then leverage our novel data structures and algorithms to meet the client’s performance and accuracy requirements for both string matching and entity disambiguation/linking:

  • Our string matching performance was just over 500 nanoseconds per string, resulting in slightly less than 2 million matches per second.
  • Combining string matching and entity disambiguation / linking took around 2 microseconds per match-and-link, resulting in slightly less than 500 thousand match-and-links per second, with 91% accuracy.

Both of these benchmarks were hit on a small hardware footprint that could easily be scaled horizontally.

The Result

There’s no better way to summarize how well our solution worked than to quote the client: “What I’ve seen is impressive. This is the first actual demonstration of viability that I’ve seen from any company that can take our data and work at our scale. This is remarkable speed and remarkable quality.” Urvin’s unique natural language technology combined with our in-depth understanding and knowledge of building AI at scale provided the solution this client needed, and the only viable solution they had seen – commercial or open source.

AI Thoughts + Insights

Case Study
Pharmaceutical Prediction for FDA Phase 3 Trial
A pharmaceutical company about to begin an FDA Phase 3 trial needed to find a partner who could make predictions on a small, very noisy dataset with confidence.
Pharmaceutical Prediction for FDA Phase 3 Trial
Case Study
Product Classification for a Global Tax Software Company
Helping one of the largest e-commerce platforms in the world figure out if there was an automated, scalable way to classify the product to collect the appropriate sales tax.
Product Classification for a Global Tax Software Company
Case Study
Legal Document Management and Robotic Process Automation
Urvin was approached by a Fortune 500 company whose in-house legal team was struggling with the volume of legal contracts that they had to review.
Legal Document Management and Robotic Process Automation