Dealing With Big Data Outside Of The Cloud: GPU Accelerated Sort


lrec2014

“The demands placed on systems to analyse corpus data increase with input size, and the traditional approaches to processing this data are increasingly having impractical run-times. We show that the use of desktop GPUs presents a significant opportunity to accelerate a number of stages in the normal corpus analysis pipeline. This paper contains our exploratory work and findings into applying high-performance computing technology and methods to the problem of sorting large numbers of concordance lines.”

Published in the Challenges in the Management of Large Corpora workshop, associated with The 9th edition of the Language Resources and Evaluation Conference (2014)

Authors: John Vidler1, Paul Rayson1, Laurence Anthony2, Andrew Scott1, John Mariani1

1: School of Computing and Communications, Lancaster University
2: Faculty of Science and Engineering, Waseda University

Full text: CMLC-2 Proceedings

Presentation at CMLC-2: Slides

The following text is approximately what was delivered at CMLC-2 at LREC’14 in May, 2014. Some changes were made during the actual run through on the day.

Slide 1 (title)

[Paul Speaking…]
As you can see this is joint work. John and I are co-presenting on behalf of our colleagues at Lancaster in the UK along with Laurence Anthony from Waseda in Japan. Laurence, as you might know, is the developer of the AntConc software and is visiting Lancaster on sabbatical this year.

Slide 2 (ToC)

I will start off the presentation to provide you with some motivation for the work and then hand over to John who will describe what we’ve done, the data we’ve used, and the results we’ve obtained, before finishing off with a summary of the work.

This research has grown from a combination of two things. First, John’s PhD work on the untapped potential in current operating system architecture designs and second, the work we’re all doing together on improving the design of corpus retrieval software.

Slide 3 (motivation 1)

So, what has led us down this route of dealing with big data outside the cloud?

There are many uses of large corpus data in multiple research areas, as listed here.

In this talk, we’re mainly considering scenarios in digital humanities and corpus linguistics where the end users may be less technically minded than in NLP and text mining.

Slide 4 (motivation – big data)

In Digital Humanities, for example, national and international digitisation initiatives are bringing huge quantities of archive material in image and full text form direct to the historian’s desktop.

In corpus linguistics, the web-as-corpus paradigm and the need for more examples of lower frequency features, encourages researchers to collect ever larger corpora.

However, any search or sort of results from these rich datasets is likely to take from minutes to hours to days using desktop corpus software such as WordSmith Tools and AntConc. Current software models are basically broken.

Slide 5+ (Motivation – current solutions)

Large infrastructure projects are aiming to provide one type of solution to this problem, but they do not allow …

In corpus linguistics, online tools can support web based access to large pre-compiled billion word corpora, however …

More recently, semi-cloud based systems can provide local access to large corpora however they are not easy to install and configure for SSH researchers.

Slide 6 (Motivation – a remaining need)

So we think there is still a need to consider how to improve the efficiency of locally based non-cloud corpus retrieval software. The kinds of operations required are the standard elements of the corpus linguistics method. For large corpora, we are hitting the limits of desktop power.

Slide 7 (motivation – a case study)

But this is where John’s PhD work comes to the rescue and we wondered whether we can use the local processing power of the underused Graphics Processing Unit. In the case study that we’ll describe, we focussed on the typical sorting task in a concordance tool where multi-level sorting of left and right context is applied. This can be a very lengthy task when many hundreds of thousands of concordance lines are extracted from a large corpus.

Slide 8 (ToC2)

[Pass to John …]

Slide 9 (Hardware – The traditional way)

[John Speaking…]

As Paul has already mentioned, the ‘normal’ way to handle the new, over-large processing and storage requirements for corpus tasks usually involves large numbers of machines, each filled with RAM to handle the load.

Processors available in this kind of environment can number in the hundreds, if you’re really lucky – but normally sit around 100-or-so, with small numbers of them sharing the same memory space.

This dispersion of data is normally handled by middleware frameworks connecting the fragmented copies together, requiring non-insignficant overhead to resolve in to final results.

Slide 10 (Hardware – The not-so-traditional way)

So, to reduce this overhead, we asked, “Can squash down the hardware requirements for this kind of processing in to a GPU?”.

Many scientific fields are already exploring the benefits of using the compute power found in graphics hardware to accelerate their processing, and we thought that there should be a way to apply this approach to corpus linguistics.

Slide 11 (Card Comparison)

I should mention that none of us are sponsored by nVidia…

Although if any of you have contacts…

I thouht I had better include this slide, as often when I’ve been speaking to researchers there seems to be the idea that the hardware is prohibitively expensive for experimentation.

On the left, we have a “commodity” GPU, more suited to actual graphics work, and one I picked up by simply walking in to a computer superstore (PC World, as it happens) and buying one over the counter for a mere £30.

While it certainly isn’t going to be the best you can get, it is capable of running the tests I will be describing shortly.

Highlighted is the card we used for our experiments, the GTX Titan, which at the time was the creme of the crop, barring the prohibitively expensive (and then, just released) specialist GPGPU “K20” card.

On the right, I’ve included the newly released K40.

Now those of you familiar with HPC deployments will probably be thinking that the 192 cores on the Titan look a bit low.

One thing to remember is that these are not “normal” processors.

Each of those cores is capable of running 2048 threads in parallel, giving us a whopping 28,672 truly concurrent threads to play with.

Those of you quick with mental arithmetic will have already worked out that the K40 therefore has 30,720 concurrent threads!

A brief note on scalability…

Slide 12 (Hardware – Scalability)

… it is possible to run several cards at once, up to three in various configurations.

In the image shown, for example, we have three cards all working in tandem, resulting in a work space of 576 cores (or 86016 threads), and 36 GB of RAM.

Slide 13 (ToC3)

Now that I’ve described the environment our tests were running in, a short note on the data we used.

Slide 14+ (Data Sources)

Because we needed a fairly simply corpus to work with for these experiemnts – as the actual linguistic results are incidental – we chose to build a quick and dirty plaintext corpus from the Project Gutenberg copyright-free library.

On the chance that any of you want to build your own for similar purposes; the steps we followed to create our input datasets are as follows.

First, grab the DVD download and extract the text-only format books.

Next, we ran a Java tool over the collected text files which generated collocations for specific terms – in this case, we deliberately chose words that would result in huge input sets; “the”, “and”, “there”, and so forth.

All words were treated as lowercase, again to keep the input sets large!

Slide 15 (Data Sources – Example Input)

[Simple explanation of slide]

Slide 16 (Toc4)

Armed with this data, we approached the GPU, and quickly found that the usual algorithms would not work in this case; quicksort and mergesort, for example, while highly parallelizable on traditional equipment, become very difficult to manage on a GPU

The architecture simply is not designed to be run like a general purpose CPU, and requires batch operations to operate effectively.

Rather than elaborate dynamic code, GPGPU computing is much better suited to static code, but run with very high thread densities, allowing any cases where the processing is innapropriate to simply fail

This is likely why other disciplines have had so much success with this kind of hardware, as the processing on their data tends to follow simple rules, but must be executed many millions of times. Simulation of physical bodies, for example.

To this end, we eventually settled on one of the older, simpler sorting algorithms; the swap sort.

The simplicity of this algorithm, paired with the sheer number of processing elements available in the card proved to be very effective.

Slide 17+ (Results – Running on the GPU)

As we can see here. Quite dramatically different.

The vertical line is the GPU throughput in concordance lines per second.

But the most interesting part, for me, is the first portion of the graph [SLIDE]

In this first section, we can see that the GPU actually under-performs the CPU for low numbers of concordance lines!
Then, suddenly, beyond 2000 lines, the performance shoots up to several orders of magnitude better

As it turned out, looking at the specification for the card, we discovered that the internal overheads involved with copying memory through the multi-layered cache inside the GPU meant that for small inputs, we actually spend more time waiting for data than processing it!

However, beyond this threshold, the performance benefits become, hopefully, plainly obvious.

Slide 18 (Toc5)

So, what did we find, in summary?

Slide 19 (Summary)

We can use GPGPU computing to perform tasks, however, the design of the software is fundimentally different than that traditionally used.

But

The processing has to be carefully tailored to the GPU; the processors available in these devices are not ‘normal’ by the standards of traditional computing, but when we do choose algorithms that mate well with the architecture the benefits can be large.

Longer-running, more complex processing works far better on the GPU, so long as they can run independent of one another.

Our sorting use-case is actually really too simple; the GPU could be utilized to do much more in similar time scales, and it is this that our research will focus on moving forward, and we hope to bring more complex corpus processing tasks to the desktop GPU soon.

And with that, all that remains is to say is…

Slide 20 (Questions)

… thank you for your time and attention, and does anyone have any questions?

Slide 21 (References)

This Post Has Been Viewed 296 Times