Data Stream Clustering: A high-throughput, continuous data clustering solution

Project Description

Market Need

High volume, high throughput data streams are common in many industries including financial services (transaction streams), communications (instant messaging, SMS, micro – blogging), web and gaming (action and event streams) and production line environments (machine generated data). The ability to analyse and gain insights from this type of data as events happen, in real – time, can be hugely beneficial . Traditionally data analytics is performed as an off – line, batch process where the results are available hours or even days after the data was produced . This means that any actions taken based on these insights will be at a considerable time interval after the original events occurred, and in many scenarios being able to analyse the live data stream and hence reduce this response delay is of critical importance . Clustering is a core data analytics technique whereby similar entities are automatically identified and grouped together . This drives many common applications of data analytics such as detecting anomalous or fraudulent activity, identifyting market segments and user behaviours, reporting spam and emerging topics and patterns . CeADAR has developed a high – throughput, scalable clustering solution for data streams that brings real – time, ‘live data’ capabilities to these advanced data analytics tasks

Technology Solution

CeADAR’s high – throughput, continuous clustering solution can process over 3 million entities per minute ( 50 , 000 + per second). Clusters of similar entities are automatically identified in the data stream and reported, along with associated statistics such as cluster size and growth rate, in near real – time . Although the system has been initially evaluated on textual data, the solution can be adapted to other content types such as transactions, images and other more complex data objects . The technology is implemented on Storm, the open source data stream processing framework used by big data companies such as Twitter, Yahoo, Groupon and Klout . Storm enables scalability and the ability to run over commodity hardware or in the cloud – as data volumes grow or shrink, new servers or cloud instances can be easily provisioned to match requirements . The continuous clustering technology uses a parallel clustering algorithm developed by CeADAR which harnesses LSH (Locality Sensitive Hashing), a technique that allows the parallel processing of the clustering task across multiple computing nodes . This allows the clustering to be applied on high volume, high throughput data streams.

Figure 1: High level overview of Storm implementation

Applicability

Clustering, the automatic identification and grouping of similar items, is a fundamental technique in many high – level data analytics tasks including:

Detecting and reporting new user behaviours and patterns
Identifying fraudulent and anomalous behaviours
Detecting spam in communications networks
Reporting emerging topics and themes in content streams
Understanding market segments and u
ser types

There are many other domain – specific and niche tasks in which clustering can also be applied.

CeADAR’s continuous clustering technology can be applied to high throughput, high volume data streams and enables these analytics tasks to be carried out live on the data . Although the initial focus has been on textual data, the core solution is data agnostic and can be adapted for clustering other data types including transactions, user actions, images etc.

Figure 2: Interactive animated visualization of continuous clustering of the live Twitter stream showing top 200 largest and top 200 fastest growing clusters

Research Team

Dr. Oisín Boydell, UCD School of Computer Science and Informatics
Prof. Pádraig Cunningham, UCD School of Computer Science and Informatics
Dr. Marek Landowski, UCD School of Computer Science and Informatics
Dr. Guangyu Wu, UCD School of Computer Science and Informatics

For more information, please visit: https://licence.ucd.ie/tech/504

Cookie	Duration	Description
__cfduid	1 month	The cookie is used by cdn services like CloudFare to identify individual clients behind a shared IP address and apply security settings on a per-client basis. It does not correspond to any user ID in the web application and does not store any personally identifiable information.
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-advertisement	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertisement".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
_GRECAPTCHA	5 months 27 days	This cookie is set by Google... In addition to certain standard Google cookies, reCAPTCHA sets a necessary cookie (_GRECAPTCHA) when executed for the purpose of providing its risk analysis.
swpm_session	session	This cookie is set by the Simple WordPress Membership Plugin. This cookie is used for membership login session and to provide access to the protected content on the website.This cookie keeps the login records so user don't want to authorise each time while moving to next page.
YSC	session	This cookies is set by Youtube and is used to track the views of embedded videos.

Cookie	Duration	Description
_ga	2 years	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_gid	1 day	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.

Cookie	Duration	Description
IDE	1 year 24 days	Used by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.
NID	6 months	This cookie is used to a profile based on user's interest and display personalized ads to the users.
test_cookie	15 minutes	This cookie is set by doubleclick.net. The purpose of the cookie is to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	This cookie is set by Youtube. Used to track the information of the embedded YouTube videos on a website.

Cookie	Duration	Description
_gat_gtag_UA_39007899_1	1 minute	No description
CONSENT	16 years 10 months 6 hours 3 minutes	No description
S	1 hour	No description

Project Description

Market Need

Technology Solution

Applicability

Research Team

Project Details

Categories:

Project Description

Market Need

Technology Solution

Applicability

Research Team

Project Details

Categories:

Share This Post

Related Projects