BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Chicago
X-LIC-LOCATION:America/Chicago
BEGIN:DAYLIGHT
TZOFFSETFROM:-0600
TZOFFSETTO:-0500
TZNAME:CDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0500
TZOFFSETTO:-0600
TZNAME:CST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20181221T160728Z
LOCATION:D167/174
DTSTART;TZID=America/Chicago:20181112T140000
DTEND;TZID=America/Chicago:20181112T143000
UID:submissions.supercomputing.org_SC18_sess151_ws_mlhpce117@linklings.com
SUMMARY:On Adam-Trained Models and a Parallel Method to Improve the Genera
 lization Performance
DESCRIPTION:Workshop\nDeep Learning, Machine Learning, Workshop Reg Pass\n
 \nOn Adam-Trained Models and a Parallel Method to Improve the Generalizati
 on Performance\n\nCong, Buratti\n\nAdam is a popular stochastic optimizer 
 that uses adaptive estimates of lower-order moments to update weights and 
 requires little hyper-parameter tuning. Some recent studies have called th
 e generalization and out-of-sample behavior of such adaptive gradient meth
 ods into question, and argued that such methods are of only marginal value
 . Notably for many of the well-known image classification tasks such as CI
 FAR-10 and ImageNet-1K, current models with best validation performance ar
 e still trained with SGD with a manual schedule of learning rate reduction
 .  \n\nWe analyze Adam and SGD trained models for 7 popular neural network
  architectures for image classification tasks using the CIFAR-10 dataset. 
  Visualization shows that for classification Adam trained models frequentl
 y "focus''  on areas of the images not occupied by the objects to be class
 ified.  Weight statistics reveal that Adam trained models have larger weig
 hts and L2 norms than SGD trained ones. Our experiments show that weight d
 ecay and reducing the initial learning rates improves generalization perfo
 rmance of Adam, but there still remains a gap between Adam and SGD trained
  models. \n\nTo bridge the generalization gap, we adopt a K-step model ave
 raging parallel algorithm with the Adam optimizer. With very sparse commun
 ication, the algorithm achieves high parallel efficiency. For the 7 models
  on average the improvement in validation accuracy over SGD is 0.72%, and 
 the average parallel speedup is 2.5 times with 6 GPUs.
URL:https://sc18.supercomputing.org/presentation/?id=ws_mlhpce117&sess=ses
 s151
END:VEVENT
END:VCALENDAR