Why Singletrack Today?
I love to mountain bike. But, as much as I love mountain biking, I don't ride when the trails are wet - it damages the trails.
Walnut Creek is a system of popular mountain biking trails in Austin, Texas. It's where I ride most often and it's so popular that "is Austin dry?" became a meme on a local biking message board.
That message board died out but the question persisted and discussion moved to a Facebook group.
Clearly I wasn't the only one who wanted to know "is Walnut dry?" and decided that answering this question would be the perfect basis for a fun project.
Specifically, I was looking for a project that met the following criteria:
- Data Science
I had recently completed a couple of data science and machine learning MOOCs and wanted a project that would help reinforce what I'd learned and push my boundaries.
With the MOOCs we worked on static data, results were confined to our Python IDE, and code only ran when we opened the IDE and executed the code.
I wanted a project that would ingest data and update predictions automatically and share those predictions with the world.
Believing the future is serverless, I definitely wanted a serverless architecture. I love the idea of only paying for the computing you use. For a small project like this it's a perfect fit and I knew a serverless architecture would keep the cost crazy-low.
First I wrote an AWS Lambda function in Python to collect weather and creek level data from OpenWeatherMap and USGS Waterservices, open a .csv file stored in AWS S3, and append the new data as a new row at the bottom of the .csv. An AWS CloudWatch trigger invokes the Lambda function each hour.
I let that function run for a year. I could have sped this up had I paid for historical weather data but
- the hourly weather data is free
- I had other projects to keep me busy in the interim
- one of the project goals is "cheap".
Ok, so now I had a year's worth of hourly weather data. But you can't train a trail condition prediction model with weather data only. You need a proper training set that includes corresponding trail condition scores for each hour.
So I headed to the aforementioned Facebook group to tap into the wisdom of my fellow mountain bikers. In the hours and days following a rainstorm, people post to Facebook to provide or request trail condition reports. Here's an example:
There were two shortcomings with the Facebook data - (1) it wasn't hourly like my weather data and (2) it was qualitative - but each of these shortcomings was easily overcome.
First, I devised a 100 point scale to quantify the wetness of the trails. The higher the number the wetter the trail and anything over 70 meant riders should stay home to avoid trail damage.
Then I manually assigned scores to each hour. There was some judgement invovled. I had to interpret the qualitative post data and translate that to a numeric score.
I also had to fill in the blanks between posts but this wasn't too hard. For each hour I calculated one day, five day, 10 day, and 20 day rainfall totals. These totals served as a good reference when filling in the blanks.
It was tedious work. I was so happy when I got to June. Between June and September Austin gets very little rain, has lots of sun, and has high temperatures so the trails are always "gtg" (Facebook shorthand for "good to go"). A couple hours later the training set was good to go.
Model Training & Deployment
The next step was training the model. I did the model training on Google Colab. I first tried an XGBoost model, my go to model, but it was too good. When I assigned scores to the training set I used increments of five (e.g. scores of 60, 65, 70, 75, etc) and the XGBoost model reflected this.
Even though my model inputs were discrete, I wanted a model that would help fill in the gaps by outputing continuous scores. So I tried a Random Forest model. Mission accomplished. The Random Forest model provided continuous output scores and the test set showed it to be more accurate as well.
With the model trained I "pickled" the model and uploaded the pickled model to AWS S3. I edited my Lambda function so that it would not only pull and save the hourly data but also apply the pickled model to the hourly data to predict trail conditions.
Now that my Lambda was generating predictions, I needed to figure out how to share them with the world. For this I created a simple WordPress site, converted the site to a static site with the Simply Static WordPress plugin, and uploaded the static site to S3 (and configured Route 53 and CloudFront to enable me to use a custom domain name.
Next, I updated the Lambda function again so that each hour when the Lambda is triggered, weather data is collected and saved, predictions are made, and index.html is edited (using Beautiful Soup) to display the updated weather data and trail conditions.
Check it out at Singletrack.today
I'm feeling pretty good about the project. I've got a working app, I increased my knowledge of machine learning and serverless deployments, and I did it on a budget (the only thing I've paid for is the domain name - all of the AWS compute and hosting has been within free tier limits).
That said, I'd love to see the project grown and evolve. Some ideas for the future:
Additional Trails - right now the project is limited to Walnut Creek but I'd love to see it expand to include trails the world over.
Better Input Data: I think I did a pretty good job of creating a training set from the Facebook post data. But, if I can convince people to volunteer as trail stewards and provide hourly, numerical trail condition scores (I'm thinking Google Forms would be quick, easy way to capture this data) then the Walnut model can be improved and models for additional trails can be added at scale.
Trail Condition Forecasts: The model predicts current trail conditions. But what if you want to know if the trails will be ready to ride tomorrow? Forecast weather data could be used to prediction future trail conditions. It'd be cool to have a trail condition forecast for the coming week so users can plan in advance.
Automatic CSV Cleanup: Every time new data is appended to the .csv file it becomes slower to open and save and, as a result, it takes longer for the Lambda function to execute. This isn't a problem now - I have a health check on my Lambda function that notifies me when the Lambda execution slows down and it only takes a few minutes to slim down the working .csv by moving old data to another .csv for long term storage. And the project is well within the free tier limits. But if additional trails and/or trail condition formats are added, this could become a more pressing issue.
A Real Database: On that topic, if additional trails and/or trail condition formats are added, it might make sense to add a real database. I opted for a .csv rather than a real database to keep things simple, cheap and serverless.
Version Control and Continuous Deployments: I'm manually deploying new code to Lambda from my laptop. If this grows to include multiple trails and multiple contributors it will become important to implement a system for version control and continuous deployments.
If you'd like to learn more about Singletrack Today, get involved with Singletrack Today, or go for a ride, please contact me using the comment for below or contact form on the homepage.