Using Apache Hadoop for Fraud Detection and Prevention

Fraud has multiple meanings and the term can be easily abused.  The definition of fraud has undergone multiple changes throughout the years and is elusive as well as fraud itself.  The modern legal definition of fraud usually contains a few elements that have to be proven in court and depends on the state/country.  For example, in California, the elements of fraud, which give rise to the fraud cause of action in the California Courts, are: (a) misrepresentation (false representation, concealment, or nondisclosure); (b) knowledge of falsity (or scienter); (c) intent to defraud, i.e., to induce reliance; (d) justifiable reliance; and (e) resulting damage.  A more general definition may contain up to 9 elements.

From the statistical or technical perspective, fraud is a rare event that results in a significant financial impact to the organization.

Both definitions emphasize that the event is rare (assuming that most of the population is law-abiding citizens), is intentional (there is no “accidental” fraud), as well as imply a significant damage caused to the defrauded party (otherwise why bother).  Fraud detection is difficult from statistical point of view for exactly these reasons: (a) the events are rare and it is difficult to build a predictive model and (b) fraud assumes a real human being behind it and incorporates elements of game theory since the fraudster is often an insider who knows how to game the system.

Fraud and Rare Events

By definition, fraud is an unexpected or rare event with significant financial or other damage.  Fraud assumes that the fraudster has some prior information how the current system works including previous successful and unsuccessful fraud cases and possibly the fraud detection mechanisms.  The above breaks the standard statistical modeling assumption, the variable independence or i.i.d. assumption, making building a reliable statistical model difficult.  Often the fraudster is working in the same industry that the fraud detection is supposed to protect, is intimately familiar with the fraud detection methods, and is actively trying to avoid detection by masquerading.

Rare event detection problem is also applicable to online advertising and marketing, particularly with predicting “long tail” events and terrorism detection.

One common example of fraud is associated with Taleb distribution where a seemingly high probability of a small gain shadows a small probability of a large loss that more than outweighs the gains.  Relatively long periods of slightly better than moderate gains are interrupted by a rare event of large losses.  It is easy to defraud investors by presenting the results of partial analysis excluding the “rare events”.

Fraud Prevention

Since fraud is so hard to prove in courts, most organizations and individuals try to prevent fraud from happening by blanket measures.  This includes limiting the amount of damage the fraudster can impact on the organization as well as early detection of fraud patterns.  For example, credit card companies can cut the credit card limit across the board in anticipation of a few negative fraud cases.  Advertisers can prevent advertising campaigns with low number of qualifying events.  And anti-terrorism agencies can prevent people with bottles of pure water from boarding the planes.  These actions are often in contrast with the company efforts to attract more customers and result in general dissatisfaction.  To the rescue are new technologies like Hadoop, Influence Diagrams and Bayesian Networks which are computationally expensive (these are NP-hard in computer science terminology) but are more accurate and predictive.

Why Hadoop?

Apache Hadoop is a distributed system for processing large amounts of data.  In a recent Hadoop Summit 2010 Yahoo, Facebook, and other companies announced that they currently process a few TBs of data per day and the volumes are growing at exponential rates.  Hadoop can be vital for solving the fraud detection problem because:

  1. Sampling does not work for rare events since the chance of missing a positive fraud case leads to significant deterioration of model quality.
  2. Hadoop can solve much harder problems by leveraging multiple cores across thousands of machines and search through much larger problem domains.
  3. Hadoop can be combined with other tools to manage moderate to low response latency requirements.

Let’s go through these reasons one by one.  Sampling is a common technique for modeling rare events.  One of the problems with sampling is that we cannot afford to throw away rare positive cases.  Even in a stratified or proportional sampling scheme one has to retain all positive cases since the model accuracy heavily depends on them (one can usually discard some negative cases though).  Given the above, the system still has to go through the whole dataset to sieve through the positive and negative cases.

Hadoop is known for its gnawing power.  Nothing can compare with the throughput power of thousands of machines each of which has multiple cores.  As was reported recently at the Hadoop Summit 2010, the largest installations of Hadoop have 2,000 to 4,000 computers with 8 to 12 cores each, amounting to up to 48,000 active threads looking for a pattern at the same time.  This allows either (a) looking through larger periods of time to incorporate events across a larger time frame or (b) taking more sources of information into account.  It is quite common among social network companies to comb through twitter blogs in search of relevant data.

Finally, one of the fraud prevention problems is latency.  The agencies want to react to an event as soon as possible, often within a few minutes of the event.  Yahoo recently reported that it can adjust its behavioral model in a response to a user click event within 5-7 minutes across several hundred of millions of customers and billions of events per day.  Cloudera has developed a tool, Flume, that can load billions of events into HDFS within a few seconds and analyze them using MapReduce.

Often fraud detection is akin to “finding a needle in a haystack”.  One has to go through mountains of relevant and seemingly irrelevant information, build dependency models, evaluate the impact and thwart the fraudster actions.  Hadoop helps with finding patterns by processing mountains of information on thousands of cores in a relatively short amount of time.

Where to look next?

Techniques for fraud detection are industry-specific as a rule and often are guarded since they obviously represent valuable information for potential fraudsters.  They are often kept confidential for this reason.  Moreover, the fraud detection techniques are usually a moving target since the fraudsters quickly adjust to the new fraud detection mechanisms.

One of the most publicized technical frauds is click fraud in on-line advertising.  Since advertisers are often charged on the per-click basis — so called PPC campaigns; there is a way to charge advertisers on a per-conversion basis, which we will cover shortly, but a different type of fraud emerges there where the advertiser tries to conceal the conversions — the traffic provider like a search web site has a clear incentive to inflate the number.  Additionally, an advertiser competitor may be incentivized to inflate the number to skew the original advertiser margin.  This can be achieved by a human or software agent that generates extra traffic and clicks on the competitor site.  Fraud management companies like Anchor Intelligence and Click Forensics estimate that approximately 20% to 30% of all clicks are fraud.  How do we know that a click is a fraud?

Decline in the number of conversions — first and most important, if your conversion rate is normally positive (that is, you are making a profit on your ad), and all of a sudden, conversion dives into negative numbers, this could be a sign of click fraud in action.  Click fraud causes extra clicks on your ad with no actual purchases, and your conversion rate will fall accordingly.

An abnormal number of clicks from the same IP address or a pattern in the access times — although this is the most obvious and easily identified form of click fraud, it is amazing how many fraudsters still use this method, particularly for quick attacks.  They may choose a to strike over a long weekend when they figure you may not be watching your log files carefully, clicking on your ad repeatedly so that when you return to work on Tuesday, your account is significantly depleted.  Part of this fraud might be unintentional when a user tries to reload a page.

Large “abandonment rate”, or numbers of visitors who leave your site quickly — another indication of click fraud can be a pattern of visitors clicking on your ad, spending the minimum amount of time on your site required by your PPC search engine to establish it as a valid click (usually 30 seconds or more), and then leaving without having left the landing page at all.

A large number of impressions, without the follow-through clicks or click on your ad — if you notice that there are a lot more impressions (views) of your website; this could indicate the impression fraud we discussed earlier. Artificial inflation of your ad impressions may cause your clickthrough rates to drop below the Google minimum, and your ad will be disabled.  Until you realize this, your competitors have free reign to use your keywords, sometimes at bargain prices.  As well, your relevancy ratings for search engines may drop as they record numerous impressions, but no interest shown via visits to other parts of your website, which could lead to a shutdown of your campaign.

Abnormally high clicks and impressions on affiliate websites — although affiliates themselves are sometimes involved in conducting click fraud schemes, they can be victims of click fraud themselves.  If one of their competitors uses this same method of excessive clicks and impressions on an affiliate’s site, the PPC search engine will soon notice an abnormally high payment to a certain affiliate and perhaps go as far as canceling that affiliate’s account, even though he or she was not engaging in any form of click fraud.

A large number of clicks coming from countries outside of your normal market area — using IP geo-location services, you can identify which country an IP address is probably coming from.

In the case of performance-based advertising, the advertiser himself is interested in concealing some of the traffic, not inflating it.  Since most of the performance-based measurements is based in beacons or pixels placed on the advertiser conversion page, advertiser has an incentive to (temporarily) block the traffic from the beacon or to completely remove it from their web-site.

Fraud is prevalent in telecom industry.  One of the leading commercially available fraud detection products is HP FMS system on which the author had a pleasure to work personally.  The types of telecom fraud include:

Subscription fraud — involves the acquisition of telecommunications services using stolen or false credentials and/or identity with no intention of paying. With subscription fraud, not only do service providers lose revenue, but also individual consumers are vulnerable to having their identity stolen and credit rating tarnished.

Technical/network fraud — occurs when someone uses equipment or technology to gain access to a service without paying. Fraudulent calls are typically billed to the legitimate owner of the line or service.  Wireless examples include cloning of cell phones or subscriber identity module (SIM) cards. Fixed line examples include clip on or line tapping, private branch exchange (PBX) hacking and calling card fraud. Prepaid services also have a large exposure to fraud with terminal tampering via magnetic strips or SIM chips, or recharging with stolen credit card numbers.

Insider fraud — occurs when individuals inside the operator provide fraudulent access to networks or otherwise thwart the ability of the operator to be paid for services used.

Handset abuse — is what takes place when stolen or lost handsets are used to consume telecommunications services that are in turn paid for by the service provider.  This is an expensive liability for carriers who absorb the costs.

Social engineering — is an effective fraud technique in which people unwittingly help perpetrators by providing sensitive data, illicit access or simply forwarding their calls without ever knowing they have done anything wrong.

All these patterns can be detected with special MapReduce pattern detection techniques. Flume offers low-latency stream processing capabilities.

Needless to say, the fraudsters also explore the potential market and invent new innovative ways to generate fraud.  One of them is deployed by Click Monkeys which deploys a vessel with animals next to the coast of California to generate seemingly random traffic.

Filed under:

5 Responses
  • Mark Kerzner / August 25, 2010 / 7:11 AM

    Alex, I like your cool, mathematical style. Also, very interesting and useful article.

  • Mark Kerzner / August 25, 2010 / 7:37 AM

    So the main point is “Cloudera has developed a tool, Flume, that can load billions of events into HDFS within a few seconds and analyze them using MapReduce.”?

    And the suggestion to use ALL logs?

    Or is there anything deeper that I am missing?

    Thank you.

Leave a comment


five × 6 =