In the context of the AI revolution that we are witnessing, with new performing algorithms popping up everyday, it’s easy to get lost. For those that deal with customer data and real use cases, a more sophisticated algorithm doesn’t necessarily convert into more value for the clients. While generative AI algorithms are on the edge right now, today we will focus on an evergreen topic that still retains strategic value: clustering algorithms.
What is clustering?
Clustering algorithms are techniques that group data points into clusters based on their similarity. Each cluster contains items that are more similar to each other than to those in other clusters. This method is used to discover natural groupings within data, without prior knowledge of the group definitions. They are part of the broader field of unsupervised learning, meaning algorithms that learn patterns from unlabeled data.
Some of the two most common types of clustering approaches are distance-based clustering, based on minimizing the distance between data points and the center of the cluster (known as “centroid”), of which the most famous one is for sure K-Means, and density-based clustering, according to which clusters are based on areas of high density of data points. Of this latter category, the most commonly used clustering technique is called DBSCAN.
To understand how clustering can be implemented successfully with customer data, let’s see a practical example.
Fig. 1 - Visual representation of different clustering techniques
Customer segmentation with Clustering techniques
For instance, a retail company could use K-Means clustering to segment its customers based on purchasing behavior and demographics. This segmentation can reveal patterns such as high-value customers who make frequent purchases or budget-conscious customers who only buy items on sale. The company can then craft a personalized communication strategy aimed at each segment, potentially increasing customer engagement and sales. A luxury segment might receive promotions for exclusive products, while budget-conscious segments could be targeted with discount offers and loyalty rewards.
Similarly, DBSCAN can be leveraged to identify more nuanced, less obvious segments by focusing on the density of data points. This is particularly useful for discovering niche markets or subgroups within larger segments that exhibit unique behaviors or preferences. For example, within a broad customer segment identified by K-Means, DBSCAN might uncover a dense subgroup of customers who are particularly loyal to a specific product line or brand. This subgroup could represent an opportunity for targeted product development or loyalty programs.
Are clustering techniques easy to implement?
Not necessarily.
In fact, implementing clustering algorithms can bring different kinds of challenges, ranging from the selection of an appropriate algorithm tailored to a specific dataset's peculiarities to ensuring the efficiency of the overall clustering process.
One of the first and most important choices to make is determining the optimal number of clusters. An insufficient number of clusters can oversimplify complex datasets, making impossible to infer underlying patterns and insights, whereas an excessive number can lead to overcomplication, diluting the meaningfulness of the insights extracted and possibly leading to misinterpretation of data relationships. Finding the optimal number of clusters is vital not just for customer segmentation, but also for tasks such as refining product recommendation systems. Accurate clustering allows for the identification of nuanced user groups with specific interests, enabling more precise and personalized product recommendations.
Let’s see some famous examples in the field. Algorithms like K-Means are renowned for their efficiency with large datasets characterized by clearly separable clusters. However, their performance wanes with unusual cluster shapes or in datasets heavily populated with outliers. On the other hand, density-based clustering approaches, such as DBSCAN, are great at identifying irregularly shaped clusters and are adept at managing noise, but struggle when dealing with clusters with similar densities. DBSCAN's ability to discern complex patterns can help in identifying subtle customer trends that might not be immediately obvious to K-Means in datasets with a large number of outliers.
The handling of outliers and noise is also indispensable for the robust performance of clustering algorithms. Certain clustering algorithms inherently possess a greater resilience against outliers, further highlighting the importance of strategic algorithm selection. By adopting strategies to address outliers and noise, data practitioners can significantly enhance the accuracy and reliability of clustering results. These strategies can often involve pre-processing data more effectively to deal with those outliers that can skew the results.
Clustering at Human37
At Human37, we know well that a solid customer data strategy can only start with a detailed work of ensuring data quality and observability, to channel data that will be used for the most adequate technique according to the needs of our clients.
Setting up a clear pipeline for this is pivotal. To exemplify how clustering can be implemented, client’s raw data from different sources should often be channeled in a proper data warehouse through adequate ingesting tools (such as Fivetran, Airbyte, Airflow). Once data is in a proper cloud-based data warehouse, such as Snowflake or Bigquery, raw data needs transformation to be good for clustering, with technologies such as dbt or Dataform, in order to prepare tables according to the use case considered. To rely on concrete examples we have dealt with in the past for some of our clients, we have meticulously prepared an enriched users table, in which each row corresponded to a user and with a variety of fields related to his or her customer journey. It can’t be stressed enough the importance of a clean and high quality table for this: assertions to check for duplicates, an intelligent partitioning to increase the efficiency, QUALIFY clauses to remove redundancies in data are of pivotal importance. Then, good clustering techniques such as the ones we have described above can be implemented, scheduling Python scripts to run in the pipeline or through powerful functions that certain data warehouses have made available, like in the case of BigQuery ML. This can open new possibilities to find unnoticed patterns in data, leading to more insightful and detailed dashboards to help managers understand their customers better. For example, after the clustering algorithm has been run, the results can be used to segment customers and label them accordingly through a new field of the enriched users table, in order to understand different patterns in their behavior and represent it through dashboarding tools such as Looker Studio or PowerBI. It doesn’t stop here. Thanks to Reverse ETL technologies, like Hightouch, it is possible to channel these new segments into Customer Communication Platforms like Iterable and achieve greater personalization in communication campaigns across different channels. To illustrate, let’s consider an e-commerce business with the goal to better understand their customer base for more effective marketing strategies. After setting up a data pipeline to ingest and transform their data, we can implement clustering techniques to segment their customers based on various behaviors such as purchase history, website interactions, and demographic information. The impact can be significant. By identifying these distinct customer clusters, it is possible to tailor marketing efforts more precisely. For example, they could target high-value customers with exclusive offers and personalized recommendations, while engaging less active customers with re-engagement campaigns. This can lead to more efficient marketing spend and improved customer engagement. Furthermore, using Reverse ETL technologies we can integrate the customer segments into their existing Customer Communication Platforms to enhance personalisation in the business’ communication campaigns across various channels.
Fig. 2 - Visualization of the various steps of the pipeline
This strategy is not just about data analysis, it's about delivering actionable insights that drive meaningful business outcomes. Our clients have been able to study at a way higher granularity the different patterns and needs of customer segments that previously weren’t thought of and this has brought new, more spot on, activations campaigns. Our work in clustering analysis exemplifies how it is possible to tailor our solutions to meet the specific needs of the clients, ultimately leading to improved customer engagement and increased business value.
Interested? Feel free to reach out!