Having wanted to get started in the machine learning space for a while now, but being a bit intimidated by not being a “data scientist”, I was introduced to a new AWS service at a recent meetup. Enter AWS Sagemaker AutoPilot, which claims to do the heavy lifting for you.

Machine Learning, Experiment, Data, Machine Learning, Test, AWS Sagemaker Autopilot

The AutoPilot lifecycle

The Spiel

Prior to writing this article, my personal ML experience was effectively zero, so this article will hopefully help you if you have an ML experiment that you would like to try out but don’t know where to start. By the end of this, you should have a deployed machine learning algorithm that can be easily invoked via a REST endpoint. At the time of writing, AutoPilot is still in preview so these steps are subject to change.

The Context

A recent Tigerspike innovation day where we partnered with a transport provider and AWS led to an opportunity to get my hands dirty with ML. The hypothesis:

“Can we leverage available data on bus occupancy to better inform user’s travel patterns by predicting how full a bus will be when it reaches their stop?”

The Input Data

For me, this meant checking out publically available data available on OpenData. For you, this means pulling together a csv of data that includes both the column you want to predict and parameters that may affect it. In my example, I’m interested in the “Capacity Bucker” field as an output to represent how full the bus was.

Top Tips: The first time I did this, I included every field in the data-set, this didn’t turn out too well as it confused the model with irrelevant inputs and increased the time it took to generate the model exponentially. If you know a column isn’t relevant, delete it. Also, ensure your data is correctly marked as a string where needed (encapsulate them with quotes) as this helps the model determine what is a categorical value vs numerical. In the example below, I tag the bus route as categorical by using quotation marks to inform SageMaker that there are set of bus routes rather than 372 being a numerical value that relates to the bus.

Input Data, Experiment, Data, Machine Learning, Test,

Input data for my model on bus capacity. Note the categorisation of ROUTE and TRANSIT_STOP

Getting started with SageMaker AutoPilot

Now for the fun bit! Log into AWS and go to AWS SageMaker. You should see a link for AWS SageMaker studio. Add a user, if you don’t have one already (you can use the default values) and once that is complete click “Open Studio” to launch Jupyter lab.

Experiment, Data, Machine Learning, Test, AWS Sagemaker Autopilot

AWS SageMaker Studio’s Control Panel once a user has been added

This will launch a welcome screen where you can get straight into an AutoPilot experiment. Hit the button below to begin.

Experiment, Data, Machine Learning, Test, AWS Sagemaker Autopilot

This is where you set up your experiment and where AutoPilot shines. It makes a whole heap of assumptions around your data so you only need to provide it with a few details:

Experiment name: Come up with something cool, but bear in mind if you take this further you’ll have many of these experiments before you settle on your model. Think ahead.

S3 location of input data: I’ll not go into this step in too much detail. Go and create an AWS bucket and upload your csv file(s) to it (ensure the name aligns with the permissions of the user you set up). If you set up your user correctly, you shouldn’t need to change anything from the default settings of the bucket.

Target attribute name: This is the column name that you are trying to predict.

S3 location for output data: You can create a separate bucket or use a folder within your first bucket here. This is where your actual model will be stored.

Machine learning problem type: None of these options meant anything to me as a data science newbie. Luckily AutoPilot swoops in and offers to work it out for you, so you can leave this as “Auto”

Complete Experiment: As I wanted to eventually deploy this model, let’s go with a full experiment.

Experiment, Data, Machine Learning, Test, AWS Sagemaker Autopilot, setup

My setup for AutoPilot keeping it as simple as possible

The Magic

There’s a bit of a waiting game next while AutoPilot does all of the hard work. If you have trimmed your columns down effectively, it shouldn’t take too long.

Top Tip: What you’ll notice once the first “Analysing Data” step is complete is that two buttons appear in the top-right of the window. It isn’t required, but I’d highly recommend reading through the “Data exploration notebook” for some tips on how to clean up your data and improve your model. This is how I realised that I had to mark certain columns as categorical. It’s a great way to kill the time while you are waiting for AWS to turn you into a data scientist.

Experiment, Data, Machine Learning, Test, AWS Sagemaker Autopilot, experiment

If you right-click on your experiment and select “Describe AutoML Job” you can see the trials that are being run on your behalf and which is currently seen as the “best.” If you are impatient, you can right-click on an individual trial and deploy it. Otherwise, sit tight until the work is done and deploy the best model at the end.

I’d encourage digging into the various levels of depths within your selected model to understand more about it, but that’s probably another article. Right-click and “Deploy model” to move forward.

On theme, you can keep the next screen pretty simple. Give the endpoint a name and choose an instance type. If you are just playing around or are cost-conscious, pick the lowest instance type you can. Hit “Deploy model” when you are done.

Experiment, Data, Machine Learning, Test, AWS Sagemaker Autopilot, settings, deployment

Deployment settings. As you can see, there’s not much too it.

Testing it out using API Gateway and Lambda

OK. Its time to start trying your model out. Unfortunately, SageMaker doesn’t give you a very easily invokable endpoint. You’ll end up with something like the below, but you’ll need the appropriate access to invoke it. If you are using the AWS client and this isn’t a problem then skip ahead. If you are like me and just want to call it from Postman without worrying about authentication, then read on.

Experiment, Data, Machine Learning, Test, AWS Sagemaker Autopilot, endpoint screen, screen

The endpoint screen once deployed

I won’t go into too much detail on API Gateway and Lambda as that’s not the scope of this article. Having said that, following these quick steps should get you going:

  1. Go to Lambda in the AWS Console and click Create Function.
  2. Give it a name and select Python as the runtime.
  3. Choose Create a new role with basic Lambda permissions under Execution Role
  4. On the next screen scroll down to “Execution Role” and open your new role in IAM. Attach the AmazonSageMakerReadOnly policy to let it invoke your new endpoint. You’ll also need to give it sagemaker:InvokeEndpoint permissions using the snippet below.
  5. Add the code snippet below as the content of your function.
  6. Add the Environment variable ENDPOINT_NAME as the name (not ARN) of your deployed model. That’s “opal-trimmed” in my case.
  7. Hit Save and you’re done
  8. Go to API Gateway in the AWS Console and click Create API
  9. Choose Rest API and click Build.
  10. Give it a name, then click Create
  11. From the Actions menu, select Create Method and select POST
  12. Select your previously created Lambda function and hit Save
  13. From the Actions menu, select Deploy API, select a stage and hit Deploy.
  14. You should now have a URL to invoke!

Code, snippet, coding

The above snippet will simply invoke your endpoint with the csv data that you will provide it.

Code, snippet, coding

This is an example policy that your lambda will need to invoke your deployed model.

Testing the model

Now that it is deployed and has been wrapped with an open API Gateway endpoint, lets hit it! Open up your preferred client, I’m using Postman, and create a POST request with your new endpoint. The body will be similar to a row of your training data CSV, excluding the column that you were trying to calculate. Here’s an example from my data set.

Code, snippet, coding
Experiment, Data, Machine Learning, Test, AWS Sagemaker Autopilot, model

The model in action!

Invoking the API, you will see the model predicted value returned! There you have it, you’ve just created and deployed a machine learning algorithm that you can use for anything you like.

Next Steps

OK, that seemed reasonably easy thanks to AWS SageMaker AutoPilot doing the majority of the heavy lifting. The challenge is that we are left with a model that is probably not all that accurate. AutoPilot probably made some decisions that benefitted its overall accuracy over actually being smart. In my example, my output field was only 3 categorical values representing a bus’ capacity rather than a numerical scale that would record minor increments and could result in a better reading. I end up with the unfortunate scenario that my model sees the “Many Seats Available” case as so popular that it can guarantee a reasonable level of accuracy just by always returning that.

The idea of an ML model is to improve over iterations and I feel with a lot of the hard work being done by AWS, it’s a bit less scary to take that jump into the first iteration.

Hopefully, you found this article useful in getting you started on what could be a game-changing feature of the already game-changing service AWS Sagemaker. Maybe it’s even made you want to get your feet wet with machine learning and all of the awesome use cases that come with it!

Karim Jlil

Senior Technical Lead,
Sydney

Karim Jlil

Senior Technical Lead,
Sydney