Data Analysis: Rule-Based Classification

Creating Personas from Existing Customer Profiles and Classifying According to Customer Purchasing

Alparslan Mesri
6 min readNov 21, 2022
Photo by Lukas on Pexels

This article is written by Alparslan Mesri and Cem ÖZÇELİK.

Organizations spend a great deal of money to provide the best service to their existing customers. While making these expenditures, knowing the customer profiles well and knowing the spending habits of the customers allows the expenditures to be directed in the best way. The knowledge of how much profit a customer in which profile brings him may depend on various variability in the customer profile. For example, depending on the degree of influence of the institution, information such as the nationality, age, and gender of the relevant customer, the type of device they shop with, the time the shopping takes place, the season in which the shopping takes place, the frequency of shopping made by the relevant customer from the institution can be effective variables when creating the customer profile. In this context, today, organizations can benefit from various methods while creating their customer profiles.

From the perspective of data science and machine learning, various segmentation and clustering algorithms offer customized performance for this and similar classification processes. However, although classification and clustering algorithms such as K-Means may sound cooler, sometimes the data set we have does not need such complex algorithms. It is also possible for us to make rule-based classifications using certain features of our customers for classification in smaller, compact datasets with less variability and uncertainty. Taking an approach in this way not only saves a great deal of work for our team but also enables us to bring a more general and acceptable solution to the current business problem.

In this study, as an example business problem, an international gaming company wants to create new level-based customer definitions (personas) by using some features of its customers to create segments according to these new customer definitions and to estimate how much new customers can earn on average according to these segments. For example, he wants to determine how much a 26-year-old Android user in Turkey can offer to the relevant company.

Now we can get to work. First, let’s get to know our data set:

Our data set consists of 5 variables including,

  • ‘PRICE’,
  • ‘SOURCE’,
  • ‘SEX’,
  • ‘COUNTRY’,
  • ‘AGE’.

Let’s also recognize these variables:

  • PRICE: The amount of expenditure of the customer.
  • SOURCE: Customer’s operating system.
  • SEX: Gender information of the customer.
  • COUNTRY: Country/nationality information of the customer.
  • AGE: Age information of the customer

First, let’s import the libraries that we will use in the study.

Next, import our dataset and see the summary of the dataset.

Dataset Overview

Now let’s examine the data set in general terms:

We can see the output of this function blog in the following image:

Descriptive Statistics of Dataset
Histogram Charts of PRICE & AGE Features

When we look at the data set, we see that there is no empty row. Of course, we should not forget that this is a study case example. In real cases, mostly the data is not so clean.

We have two digital variables (AGE, PRICE). When we look at the statistics of these two variables, we can see that the variable PRICE varies between 9 units and 59 units, and the average is 34 units. The AGE of the users varies between 15 and 66. Compared to the PRICE variable, the AGE variable is in the form of a Right-Skewed. The users are mostly young people. Considering the data belongs to a mobile gaming company, it is an expected situation.

We have some questions. The first one is:

1- How many sales have been generated for each SOURCE(Operating System)?

2- How many sales are there for each PRICE value?

As we can see from the output, the prices of the products we offer to our customers are 9, 19, 29, 39, 49, and 59.

3- How many sales are there for each COUNTRY?

In the output, we can see that the sales are concentrated in the USA and Brazil. While Germany and Turkey have close numbers in European countries, the least sales come from France. Meanwhile, worldwide, the least sales are from Canada.

4- How much was earned in total from sales made on the basis of COUNTRY

When the earnings from customers are analyzed by country, it will be seen that the highest earnings are obtained from US users, as it expected.

5- How much was earned in total from sales by SOURCE types?

As we can deduce from the results here, the gain from Android users is ~47% more than from IOS users.

6- What are the PRICE averages by COUNTRY?

Turkey and Brazil are at the top of the list, followed by Germany and the USA.

7- What are the PRICE averages by SOURCE?

Numbers are nearly the same but android users pay slightly more than IOS users.

8- What are the PRICE averages in the COUNTRY-SOURCE?

According to the output, the highest expenditure is from the Turkey-Android customer group while the least expenditure is from the France-IOS group.

9- What are the average earnings by COUNTRY-SOURCE-SEX-AGE?

Now let us examine customer groups in more detail. Male, Brazil, 46 aged, android users (also American and French customer groups) seem to spend the most.

Young female android users in France are also a remarkable group in terms of spending.

Until now, we have asked some questions to the data and have got answers. As a next step let's categorize the data according to age.

We concatenate variables to create a level-based customer definition variable.

The point we need to pay attention to here is, after creating customers_level_based values with list comprehension, these values need to be unique. For example, there cannot be more than one of the following: BRA_ANDROID_MALE_24_30.

We then look at the PRICE averages according to the CUSTOMERS_LEVEL_BASED variable we created.

We divide users into segments according to the average price of the personas we create. We name these segments A, B, C, and D.

Finally, let’s examine the statistical properties of the segments we have created:

Statistics of the Segments

As we can see in this table, there are 27 users in each segment. The segment with the highest earnings is segment A and segment D with the least income. The highest sales value realized in the A segment is 45.42 units. The minimum amount is 36.06 units. Its average is 38.69 units.

Here, as a different approach, we can combine the segments that show the most similarity to each other. For this problem, segments B and C are the 2 segments that show the most similarity to each other. When we look, the average of the B segment is 34.99, while the average of the C segment is 33.50. Max values are also similar to each other. However, sometimes combining segments that are similar to each other may cause us to ignore some distinctive information. That’s why we don’t combine these two segments in this business problem.

Now, finally, we will define a few sample users and find out which segment these users are in our system.

As the output of this for block:

In the first example, we segmented a 33-year-old male android user in Turkey, then in the second example, a 39-year-old female Ios user in France, and in the last example a 26-year-old male android user in Turkey.

We have come to the end of our study. In this study, we performed a rule-based classification on data sets with low uncertainty without the need for complex algorithms.

--

--