Monday, March 3, 2008

Storing and Processing Multi-dimensional Scientific Datasets

Alan Sussman
University of Maryland

11am, March 5, 2008
A.V. Williams 3174

Abstract

Large datasets are playing an increasingly important role in many areas of scientific research. Such datasets can be obtained from various sources, including sensors on scientific instruments and simulations of physical phenomena. The datasets often consist of a very large number of records, and have an underlying multi-dimensional attribute space. Because of such characteristics, traditional relational database techniques are not adequate to efficiently support ad hoc queries into the data. We have therefore developed algorithms and designed systems to efficiently store and process these datasets in both tightly coupled parallel computer systems and more loosely coupled distributed computing environments.

I will mainly discuss the design of two systems, the Active Data Repository (ADR) and DataCutter, for managing large datasets in parallel and distributed environments, respectively. Each of these systems provides both a programming model and a runtime framework for implementing high performance data servers. These data servers provide efficient ad hoc query capabilities into very large multi-dimensional datasets. ADR is an object-oriented framework that can be customized to provide optimized storage and processing of disk-based datasets on a parallel machine or network of workstations. DataCutter is a component-based programming model and runtime system for building data intensive applications that can execute efficiently in a Grid distributed computing environment. I will present optimization techniques that enable both systems to achieve high performance in a wide range of application areas. I will also present performance results on real applications on various computing platforms to support that claim.

About the Speaker

Alan Sussman is an Associate Professor in the Computer Science Department and Institute for Advanced Computer Studies at the University of Maryland, College Park. Working with students and other researchers at Maryland and other institutions he has published over 80 conference and journal papers in various topics related to software tools for high performance parallel and distributed (Grid) computing, and has contributed chapters to 6 books. Software tools he has built have been widely distributed and used in many computational science applications, in areas such as earth science, space science, and medical informatics. He received his Ph.D. in computer science from Carnegie Mellon University and his B.S.E. in Electrical Engineering and Computer Science from Princeton University.

9 comments:

chang said...

The talk is about building a distributed data processing system. It gave two examples, Active Data Repository and DataCutter, both of which built by the speaker. Those examples showed that theidea of mapping and aggregating has been around for a while, and gave us a better view into the background and behind-the-scene insight of the MapReduce system, be it Google or Hadoop. The speaker analyzed the two examples, their design and simulated experiments about different design choices.

Those systems are built upon the assumptions that 1) data set is large, but we want to process just part of it, and 2) we know what the output will look like, and have a pretty good idea which part of the process the bottleneck is. However, I am curious about how much these assumptions could be justified in practice. First of all, not all kinds of data is large, but it could be time-consuming to process. For example, if we want to process blog entries by some NLP technique, those entries could form a relatively small data set, which could be very hard (i.e. takes a long time for the computer) to process. Secondly, I am not sure if the data processor (or whoever it's called, human) always has a good estimation of which step in the data processing is the slowest. Even he/she does, the nature of DataCutter (that it spreads across various commodity machines) is likely to add uncertainly to this estimation.

That said, the two systems are still very useful in practice if we are not obsessed with optimal performance. The talk itself is very inspiring. I particularly like the slide that overlaps the map with the data grid. It is very effective to convey the idea of MapReduce.

Alexander Mont said...

I think this is a particularly interesting talk that demonstrates a similar, but different approach to the problem of processing large data sets. Here is a comparison of features of MapReduce, ADR, and DataCutter:

1. Types of User-Specified Processing

MapReduce: User specifies a "map" and a "reduce" operation. Each "map" produces any number of key-value pairs from a given input key-value pair, and each "reduce" takes any number of inputs on akey and produces a result. The user may, of course, perform multiple MapReduce passes if desired.

ADR: Very similar to MapReduce, except that the user must also pre-specify a set of "result" points (similar to keys in MapReduce) and each input point is mapped onto exactly one of those result points.

DataCutter: This one seems to have the fewest restrictions on processing. The user specifies a "dataflow" model - that is, a model specifying which parts of the process must be completed before others can start, and the system automatically handles getting the data between nodes and allocating nodes to stages that involve teh most processing.

Comparison: MapReduce and DataCutter seem to be more general, and can be used on just about any kind of data set. ADR seems to be more specifically designed for spatial data sets, where each data element corresponds to a point in a multidimensional space, and when you map one image onto another, you know where each of the pixels in the resulting image is likely to be , and you know each input data point corresponds to one output data point.

----

2. Mechanisms for load-balancing, fault tolerance etc.

MapReduce: There is a single node that keeps track of the state of the operation. It farms tasks out to the other nodes. All nodes are always doing something because if a node finishes early, it duplicates processing being done on the other nodes in case it can finish faster. The node can tell if faults have occurred because the failed node will not report back.

ADR: There are several different ways of farming tasks out to nodes, such as grouping by the input data location, by the output data location, or by determining it dynamically.

DataCutter: Machines are assigned to each of the tasks based on which task needs more machines. However it was not made clear how it determines which data elements to assign to which machines.

Comparison: The ADR seemed to work pretty well, although we don't know how well it compares to MapReduce. The DataCutter is the most interesting due to its apparent generality in application. One thing I don't really understand about the DataCutter is how well it handles situations where multiple data elements must be grouped together to process them, for example to add up counts. If DataCutter could solve these problems, then it could potentially be a tool as good as or better than MapReduce.

Jimmy Lin said...

I came across this article in Technology Review: Simpler Programming for Multicore Computers about stream programming... this is of interest.

Here's the homepage for Bill Thies, the guy at MIT working on it.

Chris said...

I hate to start my posting on the blog off with a negative comment, but I have to admit that I wasn’t exactly overwhelmed by this talk. The claim was made that these frameworks show that there is “more than one game in town” when it comes to distributing computation. Did anyone dispute this? I felt that what we got were essentially system descriptions of several distributed systems solving a limited set of tasks from a limited domain-- geospatial information processing. Granted, there were some general results showing that the kinds of accumulators used in a reduce operation (i.e., where they run), can impact performance, and that different strategies may be optimal under different situations. However even this section fell flat to since I don’t have any good take-away messages as to what the predictive factors are for determining this would be. Nor do I have a sense of the magnitude of difference that these strategies can make. A straw-man example of the kinds of data distributions/algorithms that each kind would be best suited for would have been the high point of the talk.
In short, what I would have liked in a talk that discusses alternative approaches to distributed computing is a common framework for comparing the various approaches. But, I felt like I got was an ad-hoc comparison of a couple of specialized distributed systems. Am I likely to build a better distributed system framework because of this talk? No.

Punit Mehta said...

The talk was very captivating as it enlightened different ways of distributed processing like Active Data Repository and DataCutter. I was in general amazed by Datacutter concept, especially the concept of using filters.
I had a couple of questions which may be because I may have not understood the concept properly. Anyways, the questions are: In Datacutter method, is the data distribution over different filters uniform? If it is not uniform, then the buffer size has to be really large. I think there was some discussion about this point. However I would like to know about this in detail. The second question is that has any work been done using ADR and Datacutter on Image Processing? If yes, then can I get a sample of work performed?
Although there are different techniques for distributed processing, but then the overall framework on which say MapReduce or ADR or Datacutter is based is the same.

Greg J. said...

I thought the systems described represent a kind of interesting subset
of use cases of Map/Reduce-able distributed computation. In cases
where the domain and range of the input are known to a degree,
optimizations exist that don't necessarily generalize to the MR
framework as a whole.

I think as distributed computing becomes more widely used, the uses
for such special, limited scope cases will increase.

I think the point that it's hard to see how any benefit is derived
when you're mapping over all the data, all the time, will little
knowledge about what the output range is going to look like, is a
valid one. And many, maybe most, pure-research use cases fit that
model. But cloud computing has the potential to move into the user
space, and people wanting to take advantage of a robust,
high-performance computing infrastructure to perform less research-y
tasks may be able to benefit from implementation of the ideas
described in the talk.

I think there are a few examples of this happening already.

Alexa offers a web search service that allows you to submit a search
to their search engine against their crawl of the web, and lets you
submit a secondary grep to filter the results, to basically "grep the
web" as they say it. Without having though about it in detail, I'd
think that that kind of process, would be amenable to some of the
optimization processes in the talk.

Also, PIG, I think, is trying to make distributed computing computing
more available to the CS masses, basically making Hadoop Scriptable.
Again, I think it's people doing this kind of scripting might have an
idea of the data their feeding it, and the results it's outputting.

A final example I think is the how-to-join in MR problem. In the
general case, if you don't know the m-n-ness of the join, or the
relative sizes of the tables or their key occurrences, the expected
output size and distribution, there's not a lot you can do. But, odds
are you *do* something about all that. And being able to provide
hints about that to the framework might, again, make the kinds of
optimizations described feasible.

Greg

Unknown said...

I would be interested to learn a lot more about DataCutter. I think that the ability to transform the data arbitrarily (instead of being forever tied to the map-reduce cycle) would be a wonderful thing to have, and could be abstracted to build very powerful applications.
He said that when he was working on it you had to manually choose which parts of the code were run in parallel, but I don't see why. I could envision modeling your program as a series of DataCutter modules, running it once to see where the bottlenecks are, and then automatically increasing the number of threads at the bottlenecks. A sophisticated enough system could assign extra threads on the fly when the throughput at one step was too low, which would give you an excellent parallel architecture.
If you could design a programming language or an interface to the DataCutter system with knowledge about what modules could run in parallel, it seems like DataCutter could be used to implement a multi-threaded language that is transparent to the programmer. When writing code you could just mark a method as "parallelizable," and let the system do the rest. As the number of processors on your average computer increases, people are going to be calling out for an easy way to write multi-threaded programs, and DataCutter, much more than MapReduce or ADR, seems to provide this ability. I may be dreaming, but it seems plausible to me at least.
ADR seemed to be too specialized, but it did not surprise me to learn that DataCutter is still being worked on. I think it has a lot of potential, and I was thrilled to learn about it.

Alan

Jimmy Lin said...

Alan---you might want to take a look at Dryad, developed at Microsoft Research.

In Dryad, a program is described as a DAG. The framework gives the programmer explicit control over the data flow, which can be dynamically optimized at run time. Sequential processing is performed at nodes, connected via channels---which could be files, TCP pipes, shared memory queues, etc.

Unknown said...

I am glad that we're looking at MapReduce in light of other approaches, and it is useful to understand what generalizations and specializations others have already come up with.

Is is becoming more apparent to me that a major strength of MapReduce is its elegance and simplicity. The title of the original paper was: Simplified Data Processing on Large Clusters, and I think that remains a goal of the framework. There is something powerful in the idea that through the simple constraint of expressing your problem in terms of a map function and a reduce function, the framework ensures that throwing more hardware at it will be effective. It seems like a lot of arguments over whether one should use something more general, or more specialized tend to miss the point that this simple constraint buys you a lot of scalability without a lot of complexity. One would be hard pressed to find a better deal.

That said, I think because of the popularity of MapReduce, more people have now fleshed out the details of more general systems like Dryad, and distributed computing will start to move in that direction, but will now be constrained by the extent to which programmers can understand and reason about the system.

It's telling that distributed systems have been around so long, but have failed to generate the kind of excitement that MapReduce has. If anything, distributed computing frameworks of the future will be forced to compete with MapReduce on its simplicity of programming.

Contributors