Deduce: Distributed Dynamic Data Analytics Infrastructure for Collaborative Environments

Today, scientific research often depends on an ability to integrate data products and perform analyses across data from many data resources. These data resources are collected and managed by a wide array of different agencies, groups, and individuals and are continually being updated and changed. New data integration frameworks are needed to help discover, track, and manage the data integration process and product’s synchronization with the distributed dynamic data resources and to schedule computational support for data integration.

Significantly improving users’ ability to build and maintain integrated data analytics and products over their lifecycle will require taking a holistic approach to the development of a dynamic data integration framework. We address the end-to-end data integration challenges of scientists, and key research challenges toward building next-generation data integration frameworks.

First, the underlying data resources are continually being updated with new data and changes to existing data, all of which needs to be discovered and tracked. We need to incorporate knowledge of the target analytics and data products to enable an assessment of the sensitivity of each data analysis or data product to data changes and additions. Second, we need metadata and provenance collection to build and track knowledge of the data. Additionally, we need user research that will help develop and support the definition of interfaces to expose the distributed data resource management infrastructure to the end-user. Finally, we need data management infrastructure and dynamic resource scheduling capabilities that can use knowledge of the data to meet the real-time constraints of the data integration tasks.

Deduce (Distributed Dynamic Data Analytics Infrastructure for Collaborative Environments) will address the capability gap between current dynamic distributed data resource infrastructure and end-user data integration needs to support data analyses and products.

Deduce provides the foundation needed to build efficient and effective dynamic distributed resource management infrastructure for data integration. The usage- and data-aware automated data integration infrastructure will directly impact scientists’ productivity. It will allow users to seamlessly access and process distributed integrated data through data analysis pipelines leading to major scientific discoveries.

This work is supported by the DOE Office of Science (Office of Advanced Scientific Computing Research)