For the last few months, we’ve been working with the TVA to help them manage hundreds of TB of data from America’s power grids. As the Obama administration investigates ways to improve our energy infrastructure, the TVA is doing everything they can to keep up with the volumes of data generated by the “smart grid.” But as you know, storing that data is only half the battle. In this guest blog post, the TVA’s Josh Patterson goes into detail about how Hadoop enables them to conduct deeper analysis over larger data sets at considerably lower costs than existing solutions. -Christophe
The Smart Grid
At the Tennessee Valley Authority (TVA) we collect phasor measurement unit (PMU) data on behalf of the North American Electric Reliability Corporation (NERC) to help ensure the reliability of the bulk power system in North America. The Tennessee Valley Authority (TVA) is a federally owned corporation in the United States created by congressional charter in May 1933 to provide flood control, electricity generation, and economic development in the Tennessee Valley. NERC is a self-regulatory organization, subject to oversight by the U.S. Federal Energy Regulatory Commission and governmental authorities in Canada. TVA has been selected by NERC as the repository for PMU data nationwide. PMU data is considered part of the measurement data for the generation and transmission portion of the so called smart grid.
PMU Data Collection
There are currently 103 active PMU devices placed around the Eastern United States that actively send TVA data while new PMU devices come online regularly. PMU devices sample high voltage electric system busses and transmission lines at a substation several thousand times a second which is then reported for collection and aggregation. PMU data is a GPS time-stamped stream of those power grid measurements which is transmitted at 30 times a second each consisting of a timestamp and a floating point value. The types of information a PMU point can contain are:
- Voltage (A,B, C phase in positive, negative, or zero sequence) magnitude and angle
- Current (A,B, C phase in positive, negative, or zero sequence) magnitude and angle
- dF/dt (change in frequency over time)
- Status flags
Commonly just positive sequence voltages and currents are transmitted but there is the possibility for all three phases. There can be several measured voltage and current phasors per PMU (each phasor having a magnitude and an angle value), a variable number of digitals (typically 1 or 2), and one of each of the remaining 3 types of data; on average there will be around 16 total measurements sent per PMU. Should a company wish to send all three phases or a combination of positive, negative, or zero sequence data, then the number of measurements obviously increases.
The amount of this time-series data created by even a regional area of PMU devices provides a unique architectural demand on the TVA infrastructure. The flow of data from measurement device to TVA is as follows:
- A measurement device located at the substation (the PMU) samples various data values, timestamps them via a GPS clock, and sends them over fiber or other suitable lines to a central location.
- For some participant companies this may be a local concentrator or it may be a direct connection to TVA itself. Communication between TVA and these participants is commonly a VPN tunnel over a LAN-to-LAN connection but several partners utilize a MPLS connection for more remote regions.
- After a few network hops the data is sent to a TVA developed data concentrator termed the Super Phasor Concentrator (or SPDC) which accepts these PMUs input, ordering them into the correct time-aligned sequence – compensating for any missing data or delay introduced by network congestion or latency.
- Once organized by the SPDC, its modular architecture allows this data to be operated on by third party algorithms via a simple plug-in layer.
- The entirety of the stream, currently involving 19 companies, 10 different manufacturers of PMU devices, and 103 PMUs – each reporting an average of 16 measured values at a rate of 30 samples a second – with a possibility of 9 different encodings (and this only from the Eastern United States), is passed to one of three servers running an archiving application which writes the data to a size optimized fixed length binary file to disk.
- A real-time data stream is simultaneously forwarded to a server program hosted by TVA which passes the conditioned data in a standard phasor data protocol (IEEE C37.118-2005) to client visualization tools for use at participant companies.
- An agent moves PMU archive files into the Hadoop cluster via an FTP interface
- Alternatively, regulators such as NERC or approved researchers can directly request this data over secure VPN tunnels for operation at their remote location.
TVA currently has around 1.5 trillion points of time-series data in 15TB of PMU archive files. The rate of incoming PMU data is growing very quickly with more and more PMU devices coming online regularly. We expect to have around 40TB of PMU data by the end of 2010 with 5 years worth of PMU data estimated to be at half a petabyte (500TB).
The Case For Hadoop At TVA
Our initial problem was how to reliably store PMU data and make it available and reliable at all times. There are many brand name solutions in the storage world that come with a high price tag and the assumption of reliable hardware. With large amounts of data that spans many disks; even at a high mean time to fail (MTTF) a system will experience hardware failures quite frequently. We liked the idea of being able to lose whole physical machines and still have an operational file system due to Hadoops aggressive replication scheme. The more we talked with other groups using HDFS the more we came away with the impression that HDFS worked as advertised and shined even with amounts of data the reliable hardware struggled with. Our discussions and findings also indicated that HDFS was quite good at moving data and included multiple ways to interface with it out of the box. In the end, Hadoop is a good fit for this project in that it allows us to employ commodity hardware and open source software at a fraction of the price of proprietary systems to achieve a much more manageable expenditure curve as our repository grows.
The other side of the equation is that eventually the NERC and its designated research institutions are to be able to access the data and run operations on the data. The concept of moving computation to the data with map-reduce made Hadoop an even more attractive choice, especially given its price point. Many of the proposed uses of our PMU data ranged from simple pattern scans to complex data mining operations. The type of analysis and algorithms that we want to run arent well suited to be run in SQL. It became obvious that we were more in the market for a batch processing system such as map-reduce as opposed to a large relational database system. We were also impressed with the very robust open source ecosystem that Hadoop enjoys; Many projects built on Hadoop are actively being developed such as:
This thriving community was very interesting to us as it gives TVA a wealth of quality tools with which to analyze PMU data using analysis techniques that are necessary to understand this data. After reviewing the factors above, we concluded that employing Hadoop at TVA kills 2 birds with 1 stone — it solves our storage issues with HDFS and provides a robust computing platform with map reduce for researchers around North America.
PMU Data Analysis at TVA
Currently our analysis needs and wants are evolving with our nascent ideas on how best to use PMU data. Current techniques and algorithms on the board or in beta include
- Washington States Oscillation Monitoring System
- Basic averages and standard deviation over frequency data
- Fast Fourier transform filters including:
- Indexing of power grid anomalies
- Various visualization rendering techniques such as creating power grid map tiles to watch the power grid over time and in history
We are currently writing map reduce applications to be able to crunch far greater amounts of power grid information than has be previously possible. Using traditional techniques to calculate something as simple as an average frequency over time can be an extremely tedious process because of the need to traverse terabytes of information; map-reduce allows us to not only parallelize the operation but also get much higher disk read speeds by moving the computation to the data. As we evolve our analysis techniques we plan to expand our range of indexing techniques from simple scans to more complex data mining techniques to better understand how the power grid reacts to fluctuations and how previously thought discrete anomalies may, in fact, be interconnected.
Additionally, we are also adding other devices such as Frequency Disturbance Recorders (FDRs, a.k.a. F-NET devices which are developed by Virginia Tech) to our network. Although these devices send samples at a third of the rate of PMU devices with a reduced measurement set, there exists the potential for many hundreds of these less expensive meters to come online which would effectively double our storage requirements. This FDR data would be interesting in that the extra data would allow us to create a more complete picture of the power grid and its behavior. Hadoop would allow us to continue scaling up to meet the extra demand not only for storage but for processing with map reduce as well. Hadoop gives us the flexibility and scalability to meet future demands that can be placed upon the project with respect to data scale, processing complexity, and processing speed.
Looking Forward With Hadoop
As we move forward using Hadoop, there are a few areas wed like to see improved. Security is a big deal in our field, especially given the nature of the data and agencies involved. We would like to see security continue to be improved by the Hadoop community as a whole as time goes on. Security internally and externally is a big part of what we do, so we are always examining our production environment to make sure we fulfill our requirements. We also are looking at ways to allow multiple research projects to coexist on the same system, such that they share the same infrastructure but can queue up their own jobs and download the results from their own private account area while only having access to the data that their project allows. Research can be a competitive business and we are looking for unique ways to allow researchers to work with the same types of data while feeling comfortable about their specific work remaining private; additionally we are required to maintain the privacy of all the data providers – researchers will only be allowed to access a filtered set of measurements as allowed by the data providers or as deemed available for research by the NERC.
In our first discussions about whether or not we would explore cloud computing as an option for processing our PMU data, we wanted to know if there was a Redhat-like entity in the space that could answer questions and provide support for Hadoop. Cloudera has definitely stepped up to the plate to fulfill this role for Hadoop. Cloudera provides exceptional support in a very dynamic space, a space in which many companies have no experience and many consulting firms can provide no solid advice. Cloudera was quick to make sure that Hadoop was right for us and then provided extremely detailed answers to all of our questions and what-if scenarios. Their whole team was exceptionally adept in getting back to us on a myriad of details most sales or front line support teams would be stymied by. Clouderas distribution for Hadoop and guidance on hardware acquisition helped in saving us money and getting our evaluation of Hadoop off the ground in a very short amount of time.