BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Chicago
X-LIC-LOCATION:America/Chicago
BEGIN:DAYLIGHT
TZOFFSETFROM:-0600
TZOFFSETTO:-0500
TZNAME:CDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0500
TZOFFSETTO:-0600
TZNAME:CST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20181221T160728Z
LOCATION:D166
DTSTART;TZID=America/Chicago:20181112T113000
DTEND;TZID=America/Chicago:20181112T120000
UID:submissions.supercomputing.org_SC18_sess173_ws_espm102@linklings.com
SUMMARY:Integration of CUDA Processing within the C++ Library for Parallel
 ism and Concurrency (HPX)
DESCRIPTION:Workshop\nAccelerators, Exascale, Parallel Programming Languag
 es, Libraries, and Models, Workshop Reg Pass\n\nIntegration of CUDA Proces
 sing within the C++ Library for Parallelism and Concurrency (HPX)\n\nDiehl
 , Kaiser, Heller, Seshadri\n\nExperience shows that on today's high perfor
 mance systems, the utilization of different acceleration cards in conjunct
 ion with a high utilization of all other parts of the system is difficult.
  Future architectures, like exascale clusters, are expected to aggravate t
 his issue as the number of cores are expected to increase and memory hiera
 rchies are expected to become deeper. One big aspect for distributed appli
 cations is to guarantee high utilization of all available resources, inclu
 ding local or remote acceleration cards on a cluster while fully using all
  the available CPU resources and the integration of the GPU work into the 
 overall programming model.\n\nFor the integration of CUDA code we extended
  HPX and enabled asynchronous data transfers from and to the GPU device an
 d the asynchronous invocation of CUDA kernels on this data.  Both operatio
 ns are well integrated into the general programming model of HPX which all
 ows to seamlessly overlap any GPU operation with work on the main cores. A
 ny user-defined CUDA kernel can be launched.\n\nWe present asynchronous im
 plementations for the data transfers and kernel launches for CUDA code as 
 part of a HPX asynchronous execution graph. Using this approach we can com
 bine all remotely and locally available acceleration cards on a cluster to
  utilize its full performance capabilities.  Overhead measurements show, t
 hat the integration of the asynchronous operations as part of the HPX exec
 ution graph imposes no additional computational overhead and significantly
  eases orchestrating coordinated and concurrent work on the main cores and
  the used GPU devices.
URL:https://sc18.supercomputing.org/presentation/?id=ws_espm102&sess=sess1
 73
END:VEVENT
END:VCALENDAR