Thursday, December 29, 2016

Processing Geospatial ShapeFile in Spark Part - 1

Geospatial Shapefile is file format for storing geospatial vector data. The file consists of 3 three mandatory - .shp.shx, and .dbf file extensions. The geographical features like water wells, river, lake, school, city, land parcel, roads have geographic location like lat/long and associated information like name, area, temperature etc can be represented as point, polygons and lines.

Other Geo Data Format 

WKT - Well Known Text


The wkt format for San Francisco Bay Area is

POLYGON((-122.84912109375 38.26487165882067,-121.7889404296875 38.26487165882067,-121.7889404296875 37.274872400526334,-122.84912109375 37.274872400526334,-122.84912109375 38.26487165882067))

After applying the polygon on the google map via Wicket


GeoJSON



The geo data can be expressed in json format known as GeoJSON. GeoJSON for geographical location like Coit Tower can be expressed in GeoJSON as below.

{
    "type": "Point",
    "coordinates": [
        -122.405802,
         37.802350
    ]
}

GeoJSON Viewer like geojsonlint

ShapeFile



The shapefile for SF Bay area can be downloaded from sfgov.org

Unzip the file

[pooja@localhost Downloads]$ unzip bayarea_cities.zip
Archive:  bayarea_cities.zip
  inflating: bayarea_cities/bay_area_cities.dbf
  inflating: bayarea_cities/bay_area_cities.prj
  inflating: bayarea_cities/bay_area_cities.sbn
  inflating: bayarea_cities/bay_area_cities.sbx
  inflating: bayarea_cities/bay_area_cities.shp
  inflating: bayarea_cities/bay_area_cities.shp.xml
  inflating: bayarea_cities/bay_area_cities.shx
[pooja@localhost Downloads]$

The extracte files can be viewed by shapefile viewer. You can download open source qgis viewer.

ShapeFile Transformation


The shapefile data can be converted easily by tools like shp2pgsql. into a PostgreSQL SQL file.

shp2pgsql <shapefile> <tablename> <db_name> > filename.sql 

for example

shp2pgsql bay_area_cities.shp cities gisdatabase > cities.sql

This shapefile can be very huge for some use case like land parcel, census etc. But this huge data will be divided or sliced by some criteria like city, state, county etc.

2 comments:

  1. Hi there, excuse me, where does spark is considere here, This is only an explication of what a shapefile is.

    ReplyDelete
  2. I will soon publish the part 2 but basically the shapefile will be input to the spark api.

    val lines = sparkSession.sparkContext.newAPIHadoopFile(inputPath, classOf[PolygonFeatureInputFormat],
    classOf[LongWritable], classOf[PolygonFeatureWritable])

    ReplyDelete