BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Chicago
X-LIC-LOCATION:America/Chicago
BEGIN:DAYLIGHT
TZOFFSETFROM:-0600
TZOFFSETTO:-0500
TZNAME:CDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0500
TZOFFSETTO:-0600
TZNAME:CST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20181221T160904Z
LOCATION:C2/3/4 Ballroom
DTSTART;TZID=America/Chicago:20181115T083000
DTEND;TZID=America/Chicago:20181115T170000
UID:submissions.supercomputing.org_SC18_sess324_post155@linklings.com
SUMMARY:Tensorfolding: Improving Convolutional Neural Network Performance 
 with Fused Microkernels
DESCRIPTION:Poster\nTech Program Reg Pass, Exhibits Reg Pass\n\nTensorfold
 ing: Improving Convolutional Neural Network Performance with Fused Microke
 rnels\n\nAnderson, Georganas, Avancha, Heinecke\n\nConvolution layers are 
 prevalent in many classes of deep neural networks, including Convolutional
  Neural Networks (CNNs) which provide state-of-the-art results for tasks l
 ike image recognition, neural machine translation and speech recognition. 
 In the recent past, several techniques to improve generalization capabilit
 ies of neural networks have been developed; the most prominent and success
 ful is batch normalization. In deep neural network training, the batch nor
 malization layer consists of a memory-bandwidth bound kernel. On the lates
 t Intel Skylake based Xeon processors, a significant portion of execution 
 time is spent in this kernel. By leveraging the CPU's large caches and its
  latency-optimized execution model, we are able to reduce this kernel's ti
 me to a bare minimum while allowing to improve forward pass layer runtimes
  by 21% compared to an unfused implementation and by 2% compared to a fuse
 d implementation.
URL:https://sc18.supercomputing.org/presentation/?id=post155&sess=sess324
END:VEVENT
END:VCALENDAR

