Linear Regression: Code (a) Line

It's time to write your first ML model and predict house prices. To follow along, go ahead and take a look at the complete product: https://github.com/yotambelgoroski/ml_unchained-house_pricing Step 1: It's all about data ML is all about data - you can't create a model without training it, and you can't train it without data. Our dataset is typically split into two parts: Training data - Data used to train a model Test - Once a model is trained, we can take input (x) from the test data, predict the output (ŷ), and compare that prediction to the real value (y). This tells us how well our model performs. In more advanced setups, you might also see a validation set, which is used to tune the model before testing it. Where does data come from? The answer depends on your business and use case. For learning purposes, Kaggle is a great source for datasets and ML resources. To keep things simple, I use a script that generates synthetic data. How much data do I need for training? There is no fixed number — as model complexity increases, more data is required. A common rule of thumb is: Have 10×–20× more data points than features (independent variables) We currently have one feature (sqm), so I used 10 records to train the model — the bare minimum to keep things simple. How much data do I need for testing? There are several approaches, but a simple one is to split your dataset using an 80:20 ratio: 80% for training 20% for testing Step 2: Training the model Now that we have our dataset, it's time to train a model. Training involves three steps: Load the training data Train the model in memory based on that data Serialization — save the trained model to disk so it can be reused without retraining Here is how it looks in code: import joblib import pandas as pd from pathlib import Path from sklearn.linear_model import LinearRegression FEATURE_COLS = ["sqm"] TARGET_COL = "price" MODEL_FILENAME = "house_price_model.joblib" def load_training_data(train_path: Path) -> pd.DataFrame: return pd.read_csv(train_path) def train_model(df: pd.DataFrame) -> LinearRegression: model = LinearRegression() model.fit(df[FEATURE_COLS], df[TARGET_COL]) return model def save_model(model: LinearRegression, dest_path: Path) -> None: dest_path.parent.mkdir(parents=True, exist_ok=True) joblib.dump(model, dest_path) print(f"Model saved → {dest_path}") def train(train_path: Path, model_dir: Path) -> LinearRegression: df = load_training_data(train_path) model = train_model(df) save_model(model, model_dir / MODEL_FILENAME) print(f"Model trained on {len(df)} samples.") return model This is it - our first model! Our Dependencies Pandas — A data handling library for working with tabular data. Its core structure, the DataFrame, allows us to easily access and manipulate data. scikit-learn — A machine learning library for Python. LinearRegression is one of its models, used to learn the best linear relationship between input features and a target value. Joblib — A utility library used here for serialization. It allows us to save a trained model to disk and load it later for inference. Congratulations — you've created your first model! However, it's not production-ready yet. Next, we’ll use the test data to evaluate how good our model really is.

Read full story →

Linear Regression: Code (a) Line

Comments

Related

I Thought I'd Just Call a Blockchain API. It Didn't Work Out That Way.

LeafWiki Devlog #10: v0.9.0 – no more broken links, lost edits, or overwritten changes

Immutable settlement protocols for real-world assets: deployed, ownerless, open templates