Skip to content

Commit 4c3e22c

Browse files
authored
update mission (#28)
1 parent 97cf1a7 commit 4c3e22c

File tree

2 files changed

+13
-4
lines changed

2 files changed

+13
-4
lines changed

content/home/mission.md

+8-1
Original file line numberDiff line numberDiff line change
@@ -9,4 +9,11 @@ design:
99
css_class: null
1010
---
1111

12-
SkyhookDM is an open source project to enable automatic mapping of data processing of structured data to heterogeneous architectures by providing a framework for efficient and composable data processing in storage and network layers. SkyhookDM leverages Apache Arrow and other open source projects that receive significant investment by the data management community in science and industry. The project strives to maximize contributions to existing open source projects while minimizing the size (and need for maintenance) of an independent codebase.
12+
Cultivate an ecosystem in which the open source software for the computational I/O stack can be developed, distributed, and sustained. This open source software must reduce barriers of adoption and meet the current and future challenges of the computational I/O stack, and the solutions should leverage the existing expertise outside storage and network I/O communities.
13+
14+
## Goals
15+
16+
- Foster collaboration around the open source data science, storage, and networking systems ecosystem
17+
- Support the development with system- and domain-specific computational I/O stack packages
18+
- Reduce barriers of adoption of open source software for the computational I/O stack
19+

content/home/why.md

+5-3
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,12 @@ design:
88
css_style: null
99
css_class: null
1010
---
11-
A key challenge in data science is extracting efficient and timely insights from an ever increasing flood of data streams. Apache Arrow, an open source data processing framework, provides an efficient and timely approach by reducing in-memory serialization and copy overheads. It is widely used in the development of data management services of structured data due to its interoperability across multiple programming languages and runtimes and plays an essential role in the rapid evolution of the open source data science ecosystem.
11+
The key advantage of the cloud is its elasticity. This is implemented by systems that can expand and shrink resources quickly and by disaggregation services, including compute, networking, and storage. Elasticity is also valuable for on-premise datacenters where disaggregation allows compute and storage to scale independently. This disaggregation however places greater demand on expensive top-of-rack networking resources since compute and storage nodes end up in different racks and even rows as the installation is growing. More network traffic also requires more CPU cycles to be dedicated to sending and receiving data. Therefore, disaggregation, somewhat paradoxically, amplifies the benefit of moving some compute – the compute that involves data management – into storage & network layers because data management filtering operations can reduce data movement significantly.
12+
13+
Apache Arrow, an open source data processing framework, provides an efficient and timely approach by reducing in-memory serialization and copy overheads. It is widely used in the development of data management services of structured data due to its interoperability across multiple programming languages and runtimes and plays an essential role in the rapid evolution of the open source data science ecosystem.
1214

1315
While many existing data management services that use Apache Arrow are well-suited for resource-rich environments, the open source data science ecosystem lacks a common framework for data management services designed for resource-constrained environments, like those found in the storage and network layers. This has led to a variety of insular and hard-to-reuse embedded data processing solutions. An efficient approach is to reduce data movement within the storage and network layers by embedding data reductive processing and caching throughout the data path. Like most emerging technologies, computational storage devices, smart NICs, and similar devices where data processing can be embedded have to overcome market entry barriers. Thus reducing complexity, increasing interoperability, and lower development costs are critical issues in embedded data management.
1416

15-
SkyhookDM is a full-stack data management framework with the purpose of bridging the gap between resource-rich and resource-constrained environments to better serve data-intensive applications. By leveraging Apache Arrow in the storage and network layers, SkyhookDM adds extra data processing capabilities to embedded devices that were previously inflexible black boxes. This allows SkyhookDM to lower market entry barriers by saving costs and accelerating the development of data management services of structured data.
17+
SkyhookDM is an ecosystem of computational I/O stack components and building blocks that bridge the gap between resource-rich and resource-constrained environments to better serve data-intensive applications. By leveraging Apache Arrow in the storage and network layers, these components add extra data processing capabilities to embedded devices that were previously inflexible black boxes. This allows for lower market entry barriers by saving costs and accelerating the development of data management services of structured data.
1618

17-
Beyond the integration and portability of data management services to embedded devices, SkyhookDM allows data management systems to interoperate with heterogeneous data management services, data sources, and devices in the storage and network layers. Sharing rich metadata from data management systems to embedded devices allows data management services to become smarter and better adapt to heterogeneous architectures. This allows truly distributed data management services to interact even more intelligently with their specific hardware and software contexts.
19+
Beyond the integration and portability of data management services to embedded devices, the SkyhookDM ecosystem allows data management systems to interoperate with heterogeneous data management services, data sources, and devices in the storage and network layers. Sharing rich metadata from data management systems to embedded devices allows data management services to become smarter and better adapt to heterogeneous architectures. This allows truly distributed data management services to interact even more intelligently with their specific hardware and software contexts.

0 commit comments

Comments
 (0)