What Makes Spatial Data Special Data?
Data scientists work with a wide variety of data. Some of that data likely includes street addresses or coordinates (e.g., latitude and longitude). However, most data scientists have not explored spatial data’s true capabilities (and complexities). There are benefits to working with a simplified view of reality known as spatial data models. Let’s consider two primary spatial data models – discrete and continuous - and learn when to use them. Discrete spatial data denotes known locations with a known boundary (such as the political border of the State of Tennessee). In contrast, continuous spatial data is estimated and does not have a known border (such as where the ocean temperature is at 59 degrees).
Discrete data is stored using vectors (e.g., points, lines, or polygons). Point data can represent where a soil test sample originated, the precise location of study trees, or where an animal was tagged. Line data often represent streams, delivery routes, wildlife migration paths, or streets. Polygon data are closed shapes and frequently represent lakes, forests, or cities.
Vector data are commonly stored as ESRI Shapefiles, consisting of a .shp file and several sidecar files (i.e., .dbf, .shx, .prj) that are all kept together in the same folder. While ESRI Shapefiles are an old and somewhat outdated standard, most public data sources provide geospatial data in this format. Therefore, a data scientist is very likely to encounter ESRI Shapefiles. Alternately, vector data may be stored using proprietary ESRI File Geodatabases (.gdb). More recently, vector data are available through an open and non-proprietary file type known as a GeoPackage (.gpkg). Using Python GeoPandas – which uses the Fiona file handler powered by GDAL (the Geospatial Data Abstraction Library) – all of these file types can be read and explored.
While discrete data are intuitive, working with continuous data adds complexity. Continuous or thematic spatial data can include representations of noise pollution, terrain elevations, precipitation, or wind speed. Data-collecting sensors cannot easily be placed on a perfect grid. Therefore, such information is calculated between discrete data points. Continuous data are frequently represented using raster files. Much like a digital image where each pixel represents a color, the ‘pixels’ of raster files contain data representing values such as a water temperature.
Common raster file types include Erdas Imagine files (consisting of an .img file and an .xml sidecar file), open and non-proprietary GeoPackage (.gpkg) geodatabases, or open standard GeoTIFF (.tif) files. Public data sources often share raster data using Erdas Imagine files, while up-to-date satellite-based optical and radar imagery is now more frequently available in GeoTIFF formats. Data scientists can explore these raster file types using the Rasterio Python library, which relies on GDAL.
We hope that this simplified overview encourages data scientists to go beyond the traditional bounds and focus on a new world of possibilities available through the exploration of spatial data. Once familiar with vector and raster data, data scientists can explore indoor mapping spatial file types (e.g., Apple Venue Format or Revit BIM), three-dimensional spatial files (Collada or Trimble Sketchup), or multitemporal spatial file formats (Network Common Data Form or Hierarchical Data Format).
Spatial data science is truly special data science.