Data Processing Methodology

The following section contains an overview of the fundamental methodology that StreetLight Data uses to develop all Metrics. Each StreetLight InSight Metric has specific methodological details which can be shared with clients as needed by request.

Step 1: ETL (Extract Transform and Load)

First, we pull data in bulk batches from our suppliers’ secure cloud environments. This can occur daily, weekly, or monthly, depending on the supplier. The data do not contain any personally identifying information. They have been de-identified by suppliers before they are obtained by StreetLight. StreetLight Data does not possess data that contains any personally identifying information.

The ETL process not only pulls the data from one environment securely to another, but also eliminates corrupted or spurious points, reorganizes data, and indexes it for faster retrieval and more efficient storage.

Step 2: Data Cleaning and Quality Assurance

After the ETL process, we run several automated, rigorous quality assurance tests to establish key parameters of the data. To give a few examples, we conduct tests to:

  • Verify that the volume of data has not changed unexpectedly,
  • Ensure the data is properly geolocated,
  • Confirm the data shares similar patterns to the previous batch of data from that particular supplier.

In addition, StreetLight staff visually and manually reviews key statistics about each data set. If anomalies or flaws are found, the data are reviewed by StreetLight in detail. Any concerns are escalated to our suppliers for further discussion.

Step 3: Create Trips and Activities

For any type of data supply, the next step is to group the data into key patterns. For example, for navigation-GPS data, a series of data points whose first time stamp is early in the morning, travels at reasonable speeds for a number of minutes, and then stands still for several minutes, could be grouped into a probable “trip.” For LBS data, we follow a similar approach. However, since LBS data continues to ping while the device is at the destination, we see clusters of pings in close proximity at the beginnings and ends of trips.

We also employ a machine-learning algorithm to detect bicycle and pedestrian trips within the LBS data. These trips can then be analyzed separately. See more information in our Active Mode Methodology, Data Sources and Validation paper. 

Step 4: Contextualize

Next, StreetLight integrates other “contextual” data sets to add richness and improve accuracy of the mobile data. These include road networks and information like speed limits and directionality, land use data, parcel data, and census data, and more.

For example, a “trip” from a navigation-GPS or LBS device is a series of connected dots. If the traveler turns a corner but the device is only pinging every 10 seconds, then that intersection might be “missed” when all the device’s pings are connected to form a complete trip. StreetLight utilizes road network information including speed limits and directionality, to “lock” the trip to the road network. Bicycle and pedestrian trips may also be locked to other networks such as trails and sidewalks, but not interstates. This “locking” process ensures that the complete route of the vehicle is represented, even though discrepancies in ping frequency may occur. Figure 2, below, illustrates this process.

Figure 1: “Unlocked” Trips becoming locked trips.


Figure 1: “Unlocked” Trips becoming locked trips.

 As another example, if a device that creates LBS data regularly pings on a block with residential land use, and those pings often occur overnight, there is a high probability that the owner of the device owner lives on that block/block group. This allows us to associate “home-based” trips and a “likely home location” to that device. In addition, we can append distribution of income and other demographics for residents of that census block to that device. That device can then “carry” that distribution everywhere else it goes. (Our demographic data sources for the US are the Census and American Community Surveys. In Canada, our source is Manifold Data.) This allows us to normalize the LBS sample to the population, and to add richness to analytics of travelers such as trip purpose and demographics.

Step 5: More Quality Assurance

After patterns and context are established, additional automatic quality assurance tests are conducted to flag patterns that appear suspicious or unusual. For example, if a trip appears to start at 50 miles per hour in the middle of a four-lane highway, that start is flagged as “bad.” Flagged trips and activities are not deleted from databases altogether, but they are filtered out from StreetLight InSight queries and Metrics.

Step 6: Normalize

Next, the data is normalized along several different parameters to create the StreetLight Index. As all data suppliers change their sample size regularly (usually increasing it), monthly normalization occurs.

For LBS devices, we perform a population-level normalization for each month of data. For each census block, StreetLight measures the number of devices in that sample that appear to live there, and makes a ratio to the total population that are reported to live there. A device from a census block that has 1,000 residents and 200 StreetLight devices will be scaled differently everywhere in comparison to a device from a census block that has 1,000 residents and 500 StreetLight devices. Thus, the StreetLight Index for LBS data is normalized to adjust for any population sampling bias. It is not yet “expanded” to estimate the actual flow of travel.

For navigation-GPS trips, StreetLight uses a set of public loop counters at certain highway locations to measure the change in trip activity each month. Then it compares this ratio to the ratio of trips at the location, and normalizes appropriately. In addition, StreetLight systemically performs adjustments to best estimate total, normalized trips based on external calibration points. Such calibration points include public, high-quality vehicle count sensors (for example, those in PEMs systems, or the TMAS repository) as well as reports from surveys and other externally validated sources. Thus, the StreetLight Index for GPS data is normalized to adjust for change in our sample size. It is not normalized for population sampling bias (because we cannot infer home blocks for GPS data). This is one of the reasons we recommend LBS data for all personal travel analytics. The StreetLight Index for GPS data is not yet “expanded” to estimate the actual flow of travel.

Step 7 – Store Clean Data in Secure Data Repository

After being made into patterns, checked for quality assurance, normalized, and contextualized, the data is stored in a proprietary format. This enables extremely efficient responses to queries via the StreetLight InSight platform. By the time the data reaches this step, it takes up less than 5% of the initial space of the data before ETL. However, no information has been lost, and contextual richness has been added.

Step 8: Aggregate in Response to Queries

Whenever a user runs a Metric query via StreetLight InSight, our platform automatically pulls the relevant trips from the data repository and aggregates the results. For example, if a user wants to know the share of trips from Origin Zone A to Destination Zone B vs. Destination Zone C during September 2017, they specify these parameters in StreetLight InSight. Trips that originated in Origin Zone A and ended in either Destination Zone B or Destination C during September 2017 will be pulled from the data repositories, aggregated appropriately, and organized into the desired Metrics.

Results always describe aggregate behavior, never the behavior of individuals.

Step 9: Final Metric Quality Assurance

Before delivering results to the user, final Metric quality assurance steps are automatically performed. First, StreetLight InSight determines if the analysis zones are appropriate. If they are nonviable polygon shapes, outside of the coverage area (for example, in an ocean) or too small (for example, analyzing trips that end at a single household) the Zone will be flagged for review. If a Metric returns a result with too few trips or activities to be statistically valid or to protect privacy, the result will be flagged. When results are flagged, StreetLight’s support team personally reviews the results to determine if they are appropriate to deliver from a statistical/privacy perspective. The support team then personally discusses the best next steps with the user.

In general, StreetLight InSight response time varies according to the size and complexity of the user’s query. Some runs take two seconds. Some take two minutes. Some take several hours. Users receive email notifications when longer projects are complete, and they can also monitor progress within StreetLight InSight. Results can be viewed as interactive maps and charts within the platform, or downloaded as CSV and shapefiles to be used in other tools.

Was this article helpful?
0 out of 0 found this helpful
Have more questions? Submit a request



Article is closed for comments.