From Notebook to Production: Building a Serverless Spam Classifier with Scikit-Learn and AWS

Introduction

Spam has evolved from a mere nuisance into a serious cybersecurity threat. To protect users, developers increasingly rely on machine learning to create intelligent filters that can tell legitimate emails apart from malicious ones. While crafting a model in a Jupyter notebook is straightforward, the true challenge lies in deploying that model into a scalable, production‑ready system that end users can interact with in real time.

From Notebook to Production: Building a Serverless Spam Classifier with Scikit-Learn and AWS — Source: www.freecodecamp.org

In this project, I built an end‑to‑end serverless spam classifier, combining Scikit‑learn for model development with AWS Lambda, Amazon S3, and Amazon API Gateway for deployment. The result is a lightweight, cost‑effective API that classifies messages instantly. The system is modular: the model can be retrained and updated independently without affecting the live API. From spotting “free iPhone” scams to detecting phishing attempts, this project demonstrates how to bridge the gap between machine learning experimentation and real‑world deployment.

Prerequisites

Before diving in, ensure you have the following:

Fundamental skills: Basic proficiency in Python and an understanding of machine learning concepts, especially classification.
AWS account: Access to an AWS account with permissions to use Lambda, S3, and API Gateway.
Environment: Python 3.11 installed, along with libraries such as scikit‑learn, pandas, and joblib.
AWS CLI: Configured on your local machine to upload files to S3.
HuggingFace account (optional): You can download the model directly from my HuggingFace repository.

Building the Brain: The Model

At the heart of this project lies a supervised learning approach. Instead of manually defining spam rules, we let the algorithm learn patterns from a labeled dataset.

Vectorization: Turning Text into Numbers

Machine learning models cannot read raw text; they require numerical input. To solve this, we use the TF‑IDF (Term Frequency – Inverse Document Frequency) vectorizer. The code snippet below shows how to apply it:

feature_extraction = TfidfVectorizer(min_df=1, stop_words='english', lowercase=True)
X_train_features = feature_extraction.fit_transform(X_train)

The mathematical formula behind TF‑IDF is:

$$w_{i,j} = tf_{i,j} \times \log \left( \frac{N}{df_i} \right)$$

Where:

wᵢ,ⱼ (Weight): The final importance score of a word in a particular document.
tfᵢ,ⱼ (Term Frequency): How often the word appears in a single email.
N (Total Documents): The total number of emails in the dataset.
dfᵢ (Document Frequency): The number of different emails that contain this word.
log(N/dfᵢ) (IDF): A penalty that reduces the weight of common words like “the” or “is”.

After vectorization, we train a classifier (e.g., Logistic Regression or Naive Bayes) on the transformed features. The trained model is then saved using joblib for deployment.

Deploying the Model to AWS

Deployment involves three main AWS services: S3 (storage), Lambda (compute), and API Gateway (entry point).

Step 1: Upload the Model to S3

First, save your trained model and the TF‑IDF vectorizer as .pkl files using joblib. Then upload them to an S3 bucket:

aws s3 cp model.pkl s3://your-bucket/spam-classifier/
aws s3 cp vectorizer.pkl s3://your-bucket/spam-classifier/

Step 2: Create a Lambda Function

Create a new Lambda function (Python 3.11 runtime). The function will:

Load the model and vectorizer from S3 when the function starts (cold start).
Receive an incoming message via the event object.
Vectorize the message using the loaded TF‑IDF transformer.
Use the loaded classifier to predict “spam” or “ham”.
Return the prediction as a JSON response.

Make sure to give the Lambda role permission to read from S3 and to write logs to CloudWatch.

Step 3: Set Up API Gateway

Create a REST API in API Gateway. Add a POST endpoint (e.g., /classify) and integrate it with your Lambda function. Enable CORS if you plan to call the API from a web frontend. Deploy the API to a stage (e.g., prod) – you’ll get a public URL that can accept requests.

How to Run the Project Locally

To test the pipeline before deploying to AWS, you can run everything on your local machine:

Clone the project repository.
Install the required packages: pip install -r requirements.txt.
Run the training script to generate the model.pkl and vectorizer.pkl files.
Test the inference script by passing a sample email (e.g., “Congratulations, you won a free iPhone!”).

Once you’re satisfied, follow the deployment steps above to move to AWS. For detailed instructions, refer to the Deployment section.

Our Project Architecture

The architecture is completely serverless and event‑driven:

Client sends a POST request to the API Gateway endpoint.
API Gateway forwards the request to a Lambda function.
Lambda loads the model artefacts from S3 (cached for subsequent invocations).
Lambda processes the input and returns a JSON response with the classification result.

This design is highly scalable: Lambda automatically handles concurrent requests, and you only pay for the compute time used. The model can be updated simply by uploading new .pkl files to S3 – no code changes needed.

Conclusion: The Power of Serverless AI

This project shows how to take a machine learning model from a local notebook to a live, production‑grade API with zero server management. By combining Scikit‑learn with AWS serverless services, you get a spam classifier that is:

Cost‑effective – pay only per request.
Scalable – handles spikes in traffic without provisioning.
Maintainable – update the model in S3 without redeploying the API.

Whether you’re building a spam filter, a sentiment analyzer, or any other classification system, the same architecture can be reused. The gap between experimentation and production is smaller than you think – serverless AI makes it possible.

For the complete code and a ready‑to‑use model, visit the project repository on GitHub.