How Apache Hadoop Helps Scan the Internet for Security Risks

The following guest post comes from Alejandro Caceres, president and CTO of Hyperion Gray LLC – a small research and development shop focusing on open-source software for cyber security.

Imagine this: You’re an informed citizen, active in local politics, and you decide you want to support your favorite local political candidate. You go to his or her new website and make a donation, providing your bank account information, name, address, and telephone number. Later, you find out that the website was hacked and your bank account and personal information stolen. You’re angry that your information wasn’t better protected — but at whom should your anger be directed?

Who is responsible for the generally weak condition of website security, today? It can’t be website operators, because there’s no prerequisite to know about blind SQL injection attacks or validation filters before spinning up a website. It can’t be website developers either — we definitely don’t equip them to evaluate website security for themselves. It’s a pretty small community that focuses on web development and web security, and that community is pretty opaque.

I decided to change that dynamic by creating the open source PunkSPIDER project. PunkSPIDER gives users the ability to evaluate website security on their own, and via the most familiar medium possible: a search engine. Specifically, PunkSPIDER  scans the entire Internet for the most basic web vulnerabilities (bsqli, sqli, and xss), indexes the results for searchability, and then provides all this information out in the open, for free.

Sound crazy? Sound hard? Sound expensive? Well, that’s where Apache Hadoop comes in — I never would have gotten the PunkSPIDER project off the ground without Hadoop. Hadoop is helping me create something literally as big as the Internet, with virtually no money and some old hardware.

We’re actually using Hadoop in a pretty unique way. Sure, we do data analytics too, and the end goal is to provide rolled-up data to the end user, but Hadoop is flexible and powerful enough to do more than that.

At its core, PunkSPIDER functions as your standard web spider, much like the one Google uses. It uses Apache Nutch to spider the Internet, collect domains, and keep this index updated. Nutch runs on top of a Hadoop cluster and provides out-of-the-box functionality to perform extremely quick crawls using MapReduce jobs. This spider is left running indefinitely, constantly updating the index and collecting new domains.

But the coolest part is what happens after PunkSPIDER has some domains in its index: From there, it moves on to searching for vulnerabilities in the indexed domains. Why is this cool?  Well, web application vulnerability scanning is a fairly memory- and CPU-intensive process. Typical scanners can be unstable, they often get caught in infinite loops, and it can take a really long time to scan a single domain. Almost all of them only work on one website at a time and provide very little automation. (This is not to disparage other scanners out there – they simply have a different purpose. But for the millions of domains that we currently have in our index, and the hundreds of million that we expect in the future, this was simply not going to work.)

So, I decided I needed to build my own scanner, called PunkSCAN, to be:

  • Extremely stable – If a single scan fails, the entire job should continue gracefully.
  • Extremely fast – Because the Internet is big.
  • Built for massive scans – Again, because the Internet is big.
  • Extremely cheap – Open source projects aren’t get-rich-quick schemes.

With Hadoop, I was able to solve every one of the issues above. PunkSCAN essentially grabs a batch of domains from our index, scans them in parallel, and returns results, indexing them as metadata on a particular domain. Everything in PunkSCAN is a Hadoop MapReduce job.

That means Hadoop takes care of a lot of the hard stuff. For example, if a job fails at any time or takes too long, it’s retried on another node in the cluster. It’s also infinitely scalable; the more machines we add to the cluster, the faster the scan goes. We were also able to build it on almost no budget; our machines are literally donated old laptops and desktops. If one of the machines dies, no big deal — Hadoop redistributes the job to another node! No data lost, and everything continues forward beautifully. The final step was to make all of this information easily searchable in the PunkSPIDER front-end, where users can search for specific websites of interest to them.

And there you have it, an awesomely elegant and simple solution provided by our friendly neighborhood elephant.

We’ve received a ton of positive feedback and most people are excited about the project, but it has understandably sparked a healthy debate. Some people have asked, “Aren’t you just giving script kiddies a gold mine of information for breaking into websites?” Well, we’re not giving malicious actors any new information that they can’t (or don’t already) get on their own. But we are giving average site owners and users access to this information, which they don’t have. Let’s face it, website vulnerabilities are rampant, and site owners and users aren’t equipped to do much about it. But we’re hoping that this project changes that by raising awareness about where the vulnerabilities are – because in website security, ignorance is the opposite of bliss.

Since PunkSPIDER was released last week, we’ve received several requests for scans. We’ve also seen some community interest in building PunkSPIDER–based browser plug-ins to alert users when they visit a vulnerable site — which is exactly what we hoped to see happen.

Hadoop technology and the incredible open-source support community around it allowed me to take my big idea and fit it into the small space of my spare bedroom, to build something new and innovative and powerful that may have a positive impact on the entire world wide web. And if you ask me, that is pretty cool.

No Responses

Leave a comment


4 × = thirty two