A Data Scientist Exploration in the World of Heterogeneous Open Geospatial Data

  • View
    177

  • Download
    1

Embed Size (px)

Text of A Data Scientist Exploration in the World of Heterogeneous Open Geospatial Data

  1. 1. A Data Scientist Exploration in the World of Heterogeneous Open Geospatial Data Gloria Re Calegari and Irene Celino CEFRIEL Politecnico di Milano Como, July 17th 2015 Free and Open Source Software for Geospatial - FOSS4G Europe 2015
  2. 2. Digital information about cities Open data (large number of data sources available on the web): Urban planning (land cover, public registers) Demographics and statistics about municipality Closed data sources produced and maintained by enterprises: Phone activity data but sometimes made open! User generated information: Volunteered geographic information and crowdsourcing information (Open Street Map) Location based social network (Foursquare check-ins and geo located information) Real-time and streaming information Sensors (e.g. Temperature, energy consumption, ..) 2Free and Open Source Software for Geospatial - FOSS4G Europe 2015
  3. 3. Data exploration process and case study A lot of data could describe the urban environment from different perspectives -> great wealth for data scientist. Managing, processing and comparing those data can be cumbersome -> smarter solutions are required. Data exploration of hetherogeneous urban information sources related to the city of Milano in Italy: Possible issues Best practices Data exploration through correlation analysis (understand if diverse information sources mirror the same picture of a city) Free and Open Source Software for Geospatial - FOSS4G Europe 2015 3
  4. 4. Milano datasets Demographics: Population density Spatial resolution: census area (6079 median size of census area 12,000 m2) Source: Milano open data Points of interest (POIs): Trasports, schools, sports facilities, amenity places, shops ... Spatial resolution: lat-long points Source: Milano open data (official, 6718) and Open Street Map (user generated, 44351) 4Free and Open Source Software for Geospatial - FOSS4G Europe 2015
  5. 5. Milano datasets Land use cover: type of land use according to CORINE taxonomy (3-levels hierarchy, up to 40 types of land use defined) CORINE taxonomy http://swa.cefriel.it/ontologies/corine.html# 2 types selected (which better feature metropolitan area as Milan) 1. Residential 2. Agricultural Spatial resolution: building level Source: Lombardy region open data 5Free and Open Source Software for Geospatial - FOSS4G Europe 2015
  6. 6. Milano datasets Call data records: 5 phone activities Incoming SMS Outcoming SMS Incoming CALL Outcoming CALL Internet Recorded every 10 minutes (144 values a day for each activity) for 2 months (Nov-Dec 2013) Spatial resolution: grid of 3538 square cells of 250m Source: Telecom Italia provided for their Big Data Challenge http://theodi.fbk.eu/openbigdata/ 6Free and Open Source Software for Geospatial - FOSS4G Europe 2015
  7. 7. Challenges Varying spatial resolution of information sources (census area for population, single points for POIs, ...) Different time frames (population census done every 10 years, tlc data every 10 minutes) Reliability (to what extent the sources can be trusted; data from public authorities or from crowdsourcing) Free and Open Source Software for Geospatial - FOSS4G Europe 2015 7
  8. 8. Best practices adopted 1) Data transformation, cleansing or normalization (standard operation) 2) Making spatial resolution uniform Spatial resolutions used: District level with 88 official subdivisions Grid level with 3.538 square cells of 250m Free and Open Source Software for Geospatial - FOSS4G Europe 2015 8 Cells Districts New datasets generated: Density of POIs in each cell/district Weighted sum of population density in each cell/district Percentage shares of each land use over each cell/district area
  9. 9. Best practices adopted 3) Data compression (pre-processing large scale time series to get a more manageable compressed representation) Free and Open Source Software for Geospatial - FOSS4G Europe 2015 9 Telecom data Footprint/temporal signature for each cell/district (average activity over all the 60 days, distinguishing between week and weekend days)
  10. 10. Correlation analysis Try to identify possible correspondences between different datasets. Measure whether and how two variables change together using correlation indexes -> Pearsons correlation coefficient Free and Open Source Software for Geospatial - FOSS4G Europe 2015 10 -1 < r < 1 Positive correlation Negative correlation
  11. 11. Correlation analysis - datasets Pairwise comparisons between 1-dimensional vectors: POIs municipality: density POIs OSM: density Population: density Telecom: first Principal Component with 90% of explained variability Land use data: residential and agricultural used separately, in term of belonging percentages to district/cell Free and Open Source Software for Geospatial - FOSS4G Europe 2015 11
  12. 12. Correlation analysis at district level Free and Open Source Software for Geospatial - FOSS4G Europe 2015 12 Correlation between Telecom and residential Telecom and POIs can actually exist. Data fits quasi linear models. tlc resid agric POI mun POI OSM pop Negative correlation between agricultural land use and other datasets -> human action inversely related to agricultural areas.
  13. 13. Correlation analysis at cell level All coefficients lower than the district level Higher values again between Telecom and residential and POIs => the choice of resolution level can have a significant impact on the correlation results. Free and Open Source Software for Geospatial - FOSS4G Europe 2015 13 tlc resid agric POI mun POI OSM pop Some phenomena causing the correlation are independent of the resolution level (0.76 residential- population) .
  14. 14. Correlation analysis: phone calls and population Free and Open Source Software for Geospatial - FOSS4G Europe 2015 14 Could the correlation change during the day according to the everyday human behaviour pattern (get up, go to work, come back home in the evening)? Call activity at 6 different day times Week and weekend profiles are different -> mirroring peoples different habits Average correlation higher in the weekend (phone activity related to the actual presence of people at home) Weekday profile -> human behaviour pattern DISTRICT CELL WEEKWEEKEND
  15. 15. Conclusions and future works To sum up... Presentation of the best practices for data exploration process applied on urban dataset of Milano Approach presented in a urban environment but can be applied also in different environment Correlation between different sources exists and it is strongly related to the resolution level adopted What is coming next? Extending our investigation toward a predicting approach Would it be possible to use one or more cheap datasets (like open data) as a proxy for more expensive data sources? Explorative analysis => statistical and machine learning techniques. 15Free and Open Source Software for Geospatial - FOSS4G Europe 2015
  16. 16. Predictive analysis (not in the paper) Support Vector Machine to classify the CORINE classes using the POIs as predictors. Accuracy > 83% Errors (black dots) on the boundary => promising results, go on in this direction! Free and Open Source Software for Geospatial - FOSS4G Europe 2015 16
  17. 17. Thank you! Any question? A Data Scientist Exploration in the World of Heterogeneous Open Geospatial Data Gloria Re Calegari and Irene Celino CEFRIEL Politecnico di Milano Free and Open Source Software for Geospatial - FOSS4G Europe 2015