What is Unsupervised Learning Function in Data Science?
Learn What is Unsupervised Learning Function in Data Science
Unsupervised learning, also known as unsupervised machine learning, is a machine learning approach that employs algorithms to analyze and cluster unlabeled datasets. These algorithms are designed to identify hidden patterns and group data without requiring human intervention.
This capacity to unveil both similarities and disparities within data makes unsupervised learning a versatile solution for a range of applications, including exploratory data analysis, cross-selling strategies, customer segmentation, and image recognition.
Common Methods in Unsupervised Learning
Unsupervised learning encompasses a diverse set of techniques, each serving three primary purposes: clustering, association, and dimensionality reduction. In the following sections, we’ll delve into each of these learning approaches and outline popular algorithms and strategies for their effective implementation.
Clustering is a data mining technique that organizes unlabelled data based on their similarities or dissimilarities. Various clustering algorithms transform unprocessed and unclassified data into sets characterized by inherent structures or patterns within the data. These clustering algorithms can be broadly categorized into several types, including exclusive, overlapping, hierarchical, and probabilistic.
Exclusive and Overlapping Clustering
Exclusive clustering is a grouping concept where a data point belongs to a single cluster exclusively, often referred to as “hard” clustering. An example is the K-means clustering algorithm, a prevalent technique for distributing data points into clusters based on their proximity to cluster centroids. K-means finds applications in market segmentation, document clustering, image segmentation, and image compression.
Overlapping clustering departs from exclusive clustering by allowing data points to belong to multiple clusters, each with varying degrees of membership. An instance of overlapping clustering is the “soft” or fuzzy k-means clustering.
Hierarchical clustering represents an unsupervised algorithm that can be categorized into two primary modes: agglomerative and divisive. Agglomerative clustering is a “bottom-up” approach that starts with individual data points as separate clusters and progressively merges them based on their similarities until a single cluster is formed. Various methods measure similarity, including Ward’s linkage, average linkage, complete linkage, and single linkage.
Divisive clustering, in contrast, follows a “top-down” approach, progressively dividing a single data cluster based on dissimilarities between data points. These clustering processes are often visualized using dendrograms.
Probabilistic clustering is an unsupervised approach used for density estimation and soft clustering tasks. Within this framework, data points are grouped based on their likelihood of belonging to specific distributions. A widely used technique is the Gaussian Mixture Model (GMM), employed to ascertain the Gaussian or normal probability distribution to which a data point is most likely associated.
These unsupervised learning methods are vital for data pre-processing, exploratory data analysis, and revealing hidden knowledge that informs decision-making and fuels innovation.
An association rule is a rule-based method for finding relationships between variables in a given dataset. These methods are frequently used for market basket analysis, allowing companies to better understand relationships between different products. Understanding consumption habits of customers enables businesses to develop better cross-selling strategies and recommendation engines. Examples of this can be seen in Amazon’s “Customers Who Bought This Item Also Bought” or Spotify’s “Discover Weekly” playlist. While there are a few different algorithms used to generate association rules, such as Apriori, Eclat, and FP-Growth, the Apriori algorithm is most widely used.
Apriori algorithms have been popularized through market basket analysis, leading to different recommendation engines for music platforms and online retailers. They are used within transactional datasets to identify frequent item sets, or collections of items, to identify the likelihood of consuming a product given the consumption of another product. For example, if I play Black Sabbath’s radio on Spotify, starting with their song “Orchid”, one of the other songs on this channel will likely be a Led Zeppelin song, such as “Over the Hills and Far Away.” This is based on my prior listening habits as well as the ones of others. Apriori algorithms use a hash tree to count itemsets, navigating through the dataset in a breadth-first manner.
While more data generally yields more accurate results, it can also impact the performance of machine learning algorithms and it can also make it difficult to visualize datasets. Dimensionality reduction is a technique used when the number of features, or dimensions, in a given dataset is too high. It reduces the number of data inputs to a manageable size while also preserving the integrity of the dataset as much as possible. It is commonly used in the preprocessing data stage, and there are a few different dimensionality reduction methods that can be used, such as:
Principal component analysis
Principal component analysis (PCA) is a type of dimensionality reduction algorithm which is used to reduce redundancies and to compress datasets through feature extraction. This method uses a linear transformation to create a new data representation, yielding a set of “principal components.” The first principal component is the direction which maximizes the variance of the dataset. While the second principal component also finds the maximum variance in the data, it is completely uncorrelated to the first principal component, yielding a direction that is perpendicular, or orthogonal, to the first component. This process repeats based on the number of dimensions, where the next principal component is the direction orthogonal to the prior components with the most variance.
Singular value decomposition
Singular value decomposition (SVD) is another dimensionality reduction approach which factorizes a matrix, A, into three, low-rank matrices. SVD is denoted by the formula, A = USVT, where U and V are orthogonal matrices. S is a diagonal matrix, and S values are considered singular values of matrix A. Similar to PCA, it is commonly used to reduce noise and compress data, such as image files.
Autoencoders leverage neural networks to compress data and then recreate a new representation of the original data’s input. Looking at the image below, you can see that the hidden layer specifically acts as a bottleneck to compress the input layer prior to reconstructing within the output layer. The stage from the input layer to the hidden layer is referred to as “encoding” while the stage from the hidden layer to the output layer is known as “decoding.”
Applications of Unsupervised Learning
Unsupervised learning finds applications in real-world scenarios, offering an exploratory approach to gain insights from data and swiftly identify patterns. Here are some common applications:
1. News Segmentation: Platforms like Google News use unsupervised learning to categorize articles related to the same news story from diverse online sources.
2. Computer Vision: Unsupervised learning algorithms play a crucial role in visual perception tasks, including object recognition.
3. Medical Imaging: Unsupervised machine learning enhances medical imaging devices used in radiology and pathology, aiding in accurate diagnoses through image detection, classification, and segmentation.
4. Anomaly Detection: Unsupervised learning excels in sifting through extensive datasets to identify atypical data points, useful for detecting faults, errors, or security breaches.
5. Customer Personas: Unsupervised learning empowers organizations to construct comprehensive buyer persona profiles, helping understand shared characteristics and purchasing behaviours among business clients.
6. Recommendation Engines: By uncovering data trends from historical purchase data, unsupervised learning facilitates the development of more effective cross-selling strategies, especially valuable for online retailers.
Unsupervised vs. Supervised vs. Semi-Supervised Learning:
Unsupervised learning works with unlabelled data, discovering patterns without predefined categories or outputs.
Supervised learning uses labelled data to make predictions or classifications based on learned patterns.
Semi-supervised learning leverages a mix of labelled and unlabeled data to enhance model performance.
While unsupervised learning uncovers hidden insights within unstructured datasets, it may not yield explicit predictions. Its importance in data pre-processing, exploratory data analysis, and innovation across various industries remains significant.