GIS

From Weeks to Seconds: The Potential of Distributed Compute for Raster Data

Logo

Liked what you read?

Subscribe to our monthly newsletter to receive the latest blogs, news and updates.

Introduction

Rasters are unique snowflakes within the geospatial data contingent. Their uniqueness demands a special style of data management in order to leverage their full potential. . Failure to do so leads to friction and inefficiencies that hurt workflows and chip away at the viability of business cases. 

In this article, we will dive into the status quo of raster data workflows, the challenges that come with it, and solutions for better data management that are on our horizon. But first, let’s set the stage by defining our key subject: raster data.

Understanding Raster Data and its Unique Complexities

Raster data is a type of geospatial data that provides a grid-based representation of spatial information. Each cell within the grid has certain values for various attributes. While this sounds simple enough, raster data gets complicated because of the way its information (covering the spherical earth) gets projected onto maps (that are flat). Each pixel or grid cell corresponds to a specific location on Earth, and because the Earth is round, distortion of area, shape, and distance occurs when representing these grids on a flat map. 

Upon first glance, raster data may seem similar to tabular data because they both contain cells in a grid. But upon closer examination, they’re worlds apart. 

For example, tabular vector data has rows and columns with a header (while raster data does not) and the values of each vector table cell are not linked to a location on Earth (while with raster data that is definitely the case). For vector tables whenever two table entries are sitting close together, this does not mean that these points of information are in close physical proximity (whereas with raster data, this is the case). These differences between vector tables and raster data show the root cause for why we have to manage raster data differently than other tabular data varieties. 

In essence, raster data is array-like and can’t be managed in a table without severe tradeoffs (like losing information about spatial relationships within the data or losing the ability to run high performance computes on the content). The fact that raster data can’t be properly managed and used when relying on table-native storage, makes dealing with this data type very challenging. After all, storing data in a tabular way is the default setting for the majority of people, and it simply doesn’t work well for this data type.

So, knowing that raster data is uniquely ill-suited for tabular handling, wouldn’t we expect it to get special treatment by the current data infrastructure paradigm? 

The Status Quo of Raster Data Processing

The current landscape of raster data processing is dominated by solutions optimized for vector or tabular data, leaving raster data processing either unsupported or suboptimal. Companies like Databricks, BigQuery, Redshift, and Snowflake provide table engine solutions, with Databricks offering a data lakehouse optimized for parallel computation on tabular data. While Databricks has introduced a geopandas integration for vector data, its approach to raster data, using the "raster frame" method, remains impractical and difficult to scale. This workaround—acknowledged by Databricks’ Geospatial team—works for small-scale processing but is far from a proper solution. 

Another group of competitors, such as Carto and Wherobots, has built specialized spatial data solutions based on tabular storage systems like PostgreSQL and Apache Spark. These systems work well for vector data but struggle with raster data, which cannot be sensibly mapped to a table. 

Both Carto and Databricks agree that computing on raster data is insufficiently supported, even when using both platforms together. This further highlights the need for an optimized solution tailored to raster data.

Some organizations have turned to building custom solutions with open-source tools like PostgreSQL and GeoServer. However, this approach can be costly for smaller companies and complex for larger ones, leading to governance issues and the creation of inflexible legacy systems. As a result, many organizations eventually seek out more robust, off-the-shelf solutions.

Lastly, solutions like Google Earth Engine and Earth on AWS host EO data and enable analysis via batch jobs. While effective for native EO data, the difficulty of preparing user-uploaded data makes these platforms less appealing for enterprises. These systems do not provide the flexibility needed for interactive data science on raster data in combination with other spatial types.

It is safe to say that all the existing solutions either fail to adequately support raster data or rely on impractical workarounds. Raster data requires a spatially-aware distribution and processing framework, which current technologies do not provide. 

What is really needed here, is a raster-native framework to complement existing table-native frameworks to provide optimised, scalable and flexible solutions for raster data computation.

Distributed Computation of Raster Data

The problem with table-based engines is actually in the name. Table. It's been optimized for all tabular data and is therefore poorly compatible with non table-native data (like raster data). To provide scalable, flexible and speedy spatial data analysis, table engines need to be supplemented with a Map Engine.

A map engine supports array-native data and creates shards in a geographic fashion. Then, when a Python command is created, it can be run both rapidly and geospatially-aware, as each node in the cluster only needs to execute the command for the geographic section it has loaded. This simple but effective strategy supports use cases in which - ever changing - spatial logic needs to be applied to large spatial datasets on the fly.

Below are some of the benefits of using a map engine for raster computation —

Reduced Processing Time — Rasters are distributed over multiple compute nodes for parallel analysis, creating results as fast as you need them. Particularly useful for running arbitrary logic on raster (and vector) data of any size.

Retained Spatial Awareness — Geography-aware distributed compute opens up the possibility of generating accurate risk assessments for spatially correlated assets and risks.

Data Visualization and use — Analytics insights can be automatically visualized on a map and rendered into any downstream system or workflow.

Conclusion

The unique nature of raster data presents significant challenges when it comes to efficient processing and integration within table-native systems. Current solutions, while powerful, struggle with the characteristics of raster data because they are designed with table-based frameworks in mind. This leads to cumbersome workarounds and inefficiencies when trying to leverage these common solutions for raster data handling. To truly unlock the potential of distributed computation for raster data, there’s a clear need for a raster-native framework—one that can handle large, geospatial datasets with the same level of optimization and flexibility that table-native systems do for tabular data. By adopting a map engine that supports array-native data, businesses can significantly reduce processing time, retain spatial awareness in their analyses, and seamlessly integrate raster data with broader workflows. With this shift, the full power of real-time, scalable raster computation can be realized, empowering more efficient and accurate geospatial analysis across industries.

Take the Ellipsis Drive tour
in less than 2 minutes

  • A step-by-step guide on how to activate your geospatial data.
  • Become familiar with our user-friendly interface & design
  • View your data integration options
See how it works
Image of Ellipsis Drive app