Malignant or Bening?

Working on Machine Learning as a Complete Beginner

11 min readMar 14, 2021

A few weeks ago, I decided to start my journey in Artificial Intelligence. Not because I’m a techy person, but because I understand the disruptive potential that it has in any given field.

Now, there was a huge challenge that I had to overcome: I didn’t like programming because I’m not good at it. I know that everybody says that, but the truth is that, in the past, I couldn’t even complete a simple Python course. Yes, that’s how much of a starter I was.

So I know the frustration that a lot of beginners may feel. This is why, in this article, my goal is to break down the most important concepts for those who are just beginners, want to create Artificial Intelligence themselves, but don’t want to get too deep into the math that it involves.

Since this guide is more practice-oriented than theory-oriented, if you are a beginner, I encourage you to check out this resource that I’ve created to learn the basics of Artificial Intelligence:

Artificial Intelligence in a Nutshell

Humanity’s last invention?

sofiash.medium.com

Our ML Toolbox

This section is meant for us to understand the big picture of what will be happening once we get into the coding part.

The first project that I chose to work on, uses a regression-type of algorithm. What this means, is that I will be talking about making predictions with Machine Learning.

6 Main steps

Let’s start right away. The 6 main steps that we’ll be following, from beginning to end, will be:

Import all the data that we’ll work with (for training and testing)
Clean/prepare that data: eliminate weird blank spaces, duplicated values, data that we don’t need, and turn most elements into a numerical value so our computers can understand it
Split data into training and testing sets: we want to make sure that our model is trained enough but we’re also left with enough data to make sure that our model truly works
Create a model: this means nothing but choosing the algorithm that we’ll use to make predictions, classify or cluster data. There are tons of algorithms out there to choose from
Train & make predictions: this is where the training data comes into play. We are using the model to process some inputs and give some outputs in response
Evaluate & tune: we want the predictions to be as accurate as possible, so we’ll measure the accuracy of our model’s predictions and make some adjustments

ML libraries

If you’re a beginner like me, you may have heard about “libraries” but you’re not sure what they refer to. Well, in simple words, a programming library is a collection of code that can be used in different contexts. That way, we don’t have to write the code again.

Many libraries are useful to us in Machine Learning. Some of the most popular include NumPy, Scikit-Learn, pandas, and matplotlib.
Don’t ask me about the weird names, but… this is what each of them offers:

Numpy: it allows us to work with arrays, which are faster than using lists. People use it for trigonometry, statistics, or algebra stuff
Scikit-Learn: it contains tools for all sorts of ML models. For instance, it’s useful for classification, regression, clustering, and dimensionality reduction
Pandas: most widely used for data science. It allows importing and manipulating data from various file formats
Matplotlib: a is a plotting library. This means that it’ll give us a visual representation of data

ML Models

Another fancy word. The way I like to think about it, the model is like a machine that processes inputs using specific kinds of instructions, which it also uses to produce certain outputs.

In other words, a model uses a mathematical algorithm to process the information you give it, and produce an output.

There are 3 main different kinds of things that we’ll like to do with AI: classify, cluster, or predict. From my perspective, these do help in understanding models, but they’re different things. Let me explain…

There are actually many different kinds of mathematical models that we can use for AI. Some examples are logistic regression, decision trees, or naïve Bayes.

Sometimes, you can solve a problem by using either one model or the other. It’s rather accuracy that you’ll want to keep an eye on.

As I was mentioning at the beginning, my goal with this project is not to become a Machine Learning developer. It’s only to build something that works. However, I can recommend some resources if you want to learn more about the mathematical models:

Neural networks and deep learning

neuralnetworksanddeeplearning.com

The challenge

Just for a bit more context, diagnosing cancer with Machine Learning is not the most difficult thing in the world these days. On the contrary, it’s quite common among online courses, and even YouTube videos. The problem is that the data can get complex. So before doing that project, I started with a simpler one.

For the first project, I’ll be explaining each of the 6 main steps in a detailed way, going through the code syntax and the tiny details. For the second project, I’ll just talk about how the changing aspects.

This will hopefully help us all — including myself — understand what the code means.

Simple project

We know that Artificial Intelligence is data-hungry. This means that the more data it can get, the better it will perform. However, from a beginner’s perspective, it may actually be better to start simple.

In this first project, I used a Machine Learning model called Decision Tree Classifier, to predict what kind of music a person would like, based on their age and gender.

0. Data and notebooks

One very important thing to notice is that the data that I worked with, was kind of fictional, meaning that I didn’t actually survey real people about their music preferences. This doesn’t affect our results though, since the data is consistent. It looks something like this:

In this case, there’s not a lot to explain about the data, except for the gender column: males are 1s and females are 0s.

Being a complete beginner, I needed to decide where — meaning what platform — I would use to create my projects. The most recommended options that I found were Jupiter Notebook and Google Colab.

Because I already use Google Drive for a lot of things, I decided to go for Colab. So far, I can definitely recommend it :)

1. Import the data

If you remember, the first (formal) step to create an ML model is to import the data. As part of our toolbox, we have a tool called Pandas which can help us do exactly that. We can invite it to our coding party like this:

import pandas as pd

After doing that, we’ll be able to use it to open our file by using the read method and the name of the file as an argument. For simplicity’s sake, we’ll save that to a variable called df, which stands for data frame. We now have:

import pandas as pd
df = pd.read_csv(‘music.csv’)

2. Prepare the data

I like to think of Machine Learning models as high-performing athletes. You’re normally preparing them to do difficult stuff, so you need to make sure that they’re having enough amount of training, and that they’re having the right diet.

In technical terms, the diet is actually our data. We clearly want to make sure that it’s easy for our model to digest. So the next thing we’ll do is splitting it into dependent and independent variables.

For this specific project, age and gender are the variables that determine what music genre a person likes. Therefore, we can infer that our dependent variable is the genre.

As always in math, we will represent the independent variables with an X, and the dependent variables with a y.

To separate each value, we will use a python method called drop. The result of this will be one table with the age and gender values (X), and another one with the genre values (y).

X = df.drop(columns=[‘genre’])
y = df[‘genre’]

3. Training and testing sets

Continuing with the athlete analogy, we want them to have enough amount of training so they come prepared for the marathon, but don’t burn out before time.

For this ML project, since we only have a limited amount of data, we need to divide the data into training and testing sets. Every expert will (almost) always recommend using 80% of data for testing.

To do this, we first import an sklearn library, and then use the function in bold to split the data accordingly. The 0.2 is the 20% of data that will be used during training.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

4. Choose a model

As I was saying before, there are many types of ML models that we can use, depending on what we want to do. In this case, we will be using a Decision Tree Classifier, which in my opinion, is one of the simplest to understand.

The algorithm will go through a series of questions, narrowing down what could be the answer, just like the image on the left, except that it’s automated, and related to music.

5. Train & make predictions

After knowing which algorithm we want to use, we need to figure out which library offers it. Whatever algorithm you’re thinking of, it’s very likely that SciKit learn offers it.

This said, the next steps are importing a new library, then assigning the DecisionTree instance to an object. This is very simple:

from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()

It’s time for training. If our model is an athlete, we want it to be… fit!
So we’ll actually be using the fit method on our model now. It’s important to make sure that we’re using the train data sets.

model.fit(X_train, y_train)

6. Test & tune

Enough training. Time for the marathon… making predictions is now possible with the predict object and the testing data set! 🙌

predictions = model.predict(X_test)

In ML, results are not enough. We need to measure the accuracy score of the predictions. The accuracy_score function will help us do this easily. It compares the predictions we just made, with the actual values that the table has.

from sklearn.metrics import accuracy_score
accuracy.score(y_test, predictions)

For this small project, we won’t have a lot of tuning. The accuracy score will be varying between 50% and 100%. Considering that we have around 20 rows of information, that sounds fair. In a more professional environment, there would be thousands of rows and columns to train and test our model.

More complex project

This is probably what you came for. Let’s apply our knowledge to the field of AI diagnostics. Specifically, we’ll be working with a breast cancer data set that contains information about the smoothness, area, texture, radius, etc. of tumors.

The main aspects that change from the former project to this one, are the data, how we manipulate that data, and the algorithms used.

Data and libraries

In addition to pandas, this time I used NumPy, MatPlotLib and Seaborn as the libraries, which are all imported in the same way as the previous libraries.

Since this data set was way bigger than the previous one, the data preparation process was also different. I learned how to count the rows and columns of the file (using the shape method), count the empty values and eliminate them (isna and dropna methods, respectively).

df.shape
df.isna().sum()
df = df.dropna(axis=1)

Furthermore, this project involved visualizing data using matplotlib. I can say that even when this was a relatively simple thing, it was also an exciting part. The following graph shows the number of benign and malignant tumors (B and M, respectively).

Manipulating the data

Don’t worry. This won’t negatively affect the results. On the contrary, we will be helping our model understand the data better by transforming the Bs and Ms into 1s and 0s.

As many other libraries, we’ll be importing something called LabelEncoder from sklearn.preprocessing. Just as with the previous project, our results will be y. Meaning that we’ll be transforming the Bs and Ms stored in Y, into 0s and 1s.

The [:,1] thing in the code is telling us that we want to work with all rows in the column with index 1. Once we run “df.iloc[:,1]”, we will see how the values are transformed to 1s and 0s.

LabelEncoder_Y = LabelEncoder()
df.iloc[:,1] = LabelEncoder_Y.fit_transform(df.iloc[:,1].values)

In the tutorial that I watched to create this model, the guy also dives a lot into data visualization. That isn’t so relevant to me at this moment, so I’ll skip that and talk about another difference: the syntax for separating dependent and independent variables is a little different in this project.

Remember that we previously learned how to count the rows and columns in our data set? Well, I now see that this is useful because I can know from which column to which column I’ll use to train and test my model.

For the code below, we have that we want all the rows and the columns from 2 to 31 for X, and all the rows for column with index 1.

X = df.iloc[:,2:31].values
y = df.iloc[:,1].values

More models

After that, I was gladly surprised that there was not much more complexity. The way we split training and testing data is the same, and the syntax for having a fit model isn’t very different.

As of using other models, I discovered that it isn’t as complex as I thought. Of course, there may be some values like the random state, that I don’t fully understand yet, but other than that, you just need to import the right libraries and fit your variables.

#1 Logistic regression
from sklearn.linear_model import LogisticRegression
log = LogisticRegression(random_state = 0)
log.fit(X_train, y_train)#2 Decission tree
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(criterion = ‘entropy’, random_state = 0)
tree.fit(X_train, y_train)#3 Random forest
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators = 10, criterion=’entropy’, random_state=0)
forest.fit(X_train, y_train)

Calculating accuracy scores and making predictions was very similar as well, and to my surprise (again), the algorithm with the highest accuracy score was the Decision Tree one.

In the end, the comparison of our model’s predictions versus the actual values, should look like this:
If you have a good eye, you’ll notice that our model isn’t 100% accurate. It’s just good.

Just good isn’t what we want when talking about health. But let’s also remember that the next step would be tuning the model. That’s something that I’m still figuring out.

Bye & build

That’s it for this article. I definitely learned a bunch of new things, and I hope that I’ve also accomplished my mission of showing others how anyone can work on ML projects.

My conclusion? It’s learnable. I’m not gonna say that Machine Learning is the easiest thing in the world, nor that it is the hardest. It’s just one of those things, that you can learn if you’re interested and committed enough.

I do continue believing that you don’t need to know everything about math, programming, or computer science to be able to build things like these, but I’ll let you know what I think in a few weeks more when I build some more projects. For now, I’ll just say…

IT’S TIME TO BUILD!

Hey! I’m S🧠FIA, an ambitious teenager building innovative projects with 🧬Synthetic Biology and Artificial Intelligence.
Just for growth, I also innovate at TKS🦄, create content, play the piano, read a lot, and 🌎 connect with new people every week (hit me up!).