SleepVLM: Explainable and Rule-Grounded Sleep Staging via a Vision-Language Model

1 SleepV LM: E xplai nabl e and Rule - Grounde d Sleep S tagin g via a Vision - Language Mode l Guifeng Deng 1,3 , Pan Wang 2 , Jiquan W ang 4 , Shuying Rao 1,3 , Junyi Xie 1 , W anjun Guo 1,4,5 , T ao Li 1,4,5 ✉ , Haiteng Jian g 1,2,4,5 ✉ 1 Affiliated Me ntal He alth Center & Han gzhou S eventh People’s Hos pital, Sch ool o f Brain Science an d Brain Medici ne, an d Liangz hu Labor atory , Z hejiang U niversity Schoo l of Medicin e, Hangzho u, 3100 58, Chin a. 2 Departmen t of Ps ychiatry and Ment al Hea lth, Wenzh ou Medic al Univer sity , Wenzhou 325035, Z hejiang Province , Chi na. 3 College of Biomed ical En gineer ing & Ins trument Scien ce, Zhej iang Univ ersity , Hangzhou 31005 8, China. 4 MOE Fron tier Sci ence C enter for Brain Sc ience and Br ain - mac hine Inte gration, Stat e Key Laborator y of Brai n - mach ine Intel ligence, Z hejia ng Univ ersity , Hangzh ou, 31 1 121, China. 5 Zhejiang Key Labor atory of Clinic al and Basic R esearc h for Psyc hiatric D iseases , Hangz hou 310058, Chi na ✉ email: litaoz jusc @zju.ed u.cn ; h.ji ang@zj u.edu.cn While aut omated sl eep stagi ng has ac hieved ex pert - level accurac y , its clinic al adopt ion is hindered by a lack of auditabl e reaso ning. We in troduc e SleepV LM, a ru le - gro unded v ision - language model ( VLM) des igned t o stage s leep fro m multi - channe l polys omnogr aphy (P SG) waveform images while g ener ating c linician - readab le rational es bas ed on Am erican Ac ademy of Sleep Medici ne (AA SM) scoring c riteria. Utiliz ing wav eform - per ceptua l pre - t raining a nd rul e - grounded supervis ed f ine - tuning, Slee pVLM ac hieved Cohen ’ s ka ppa sc ores of 0.767 o n an held out test s et (MAS S - SS1) and 0.743 o n an external cohort ( ZUAMHCS) , matching s tate - of - the - art p erformanc e. Exper t evalu ations fur ther v alidated th e quality of the m odel ’ s r easoning, with mean s cores exc eeding 4 .0/5.0 f or fact ual acc uracy , ev idence c ompreh ensiv eness, an d logical co herenc e. By c oupling c ompetitiv e perfo rmance w ith trans parent, rule - b ased explanat ions, Sl eepVLM m ay improve the tr ustworth iness and auditab ility of automate d sle ep stagin g in clin ica l work flo ws . T o facilitat e further r esea rch in interpret able sl eep medicine, we release MAS S - EX, a novel ex pert - annot ated datas et. Introduction Sleep d isorder s represen t a major g lobal p ublic h ealth chal lenge. Insufficient sl eep is p revalent acr oss all ag e group s and has b een recognized as an underrepor ted epid emic with far - reaching con sequences f or cardiov ascular, metabolic, and cog niti ve he alth [1] . Obstru ctive sleep ap nea alo ne is esti mated to affect n early on e billion adults aged 3 0 – 69 year s worl dwi de [2] . Pol ysom no graph y (PS G), the s im ultane ous recordi ng of ele ctroe nce phal ogra phy (E EG), 2 elec trooc ulogr aphy ( EOG) , a nd e lect romyogr aphy ( EMG) durin g sle ep, re ma ins the clini cal gol d sta ndard for diagn osin g slee p - dis order ed bre athin g, nar cole psy , par asom nias , and othe r condi tions [3] . Unde r c urre nt clinical practice, train ed sleep tech nologists v isually inspect mu lti - channel P SG reco rdings and cl assify each 30 - seco nd epoch int o o ne of fi ve sta ges — W akefulnes s (W) , non - ra pid e ye movem ent s ta ges N 1, N 2, a nd N 3, and r a pid e ye movemen t sleep ( R) — acc ording t o the r ules codi fied in th e American Academy of Sleep Medi cine (AASM) scoring manual Over t he pas t dec ade, dee p lea rn ing ha s yiel ded s ubs tanti al gai ns in aut omat ic s leep s tagi ng. Co nvol utio nal a nd recurren t archi tectures such as Deep Sleep Net [8] and T inySleep Net [9] in trod uced end - to - end sequen ce modeling from raw signals; atte ntion - bas ed me t hods inc l uding Att nSlee p [10] and Sleep T ransfo rmer [6] lev eraged self - attention to captur e te mpora l de pende nci es; a nd f ully c onvol uti onal a pproa che s li ke U - Sleep [ 11] an d U - Time [12] en abled s calable multi - chan nel s cori ng. On l ar ge - s cale bench marks, sev eral of these m ethods now ap proac h or mat ch huma n inter - rater agr eement lev els [13,14] . However , all su ch models op erate as bl ack - box cl assifiers — they output a pr edic ted sleep stag e label witho ut any exp lanati on of the d ecisio n process. Clini cians can not det ermine which wavefo rm features d rove a parti cular cl assification , wheth er the model’ s r eas oning ali gns w it h AAS M rul es, or why a give n epoch was scored as it was. This opacit y undermin es clinical trust and is wid ely recogn ized as a major obstacl e to tran slating AI sleep staging into routine practice [6,15,16] . Muto a nd Ber thomie r [17] emph asized t hat auto mated sleep scori ng shou ld c omplem ent r athe r than repla ce vi sual e xpert ass ess me nt, and c alle d for h ybri d sol utions in w hich explai nab le aut omat ed t ools rem ain u nder e xpert over sig ht. Recogn izing thi s gap, researcher s have pu rsued mul tiple strateg ies to endow sleep staging mo dels with interpretability . Post - hoc attent ion visual ization, as in Sleep T ransformer [6] , high lig hts whi ch te mpor al s egme nts rec eive t he hig hest atte nti on wei ghts. Prot oty pe - based learning , exempli fied by W aveSleepNet [18] , maps mo del decisio ns to lear ned wav eform proto types resemb ling can onical AASM featur es. Gradi ent att ribution meth ods su ch as Grad - CAM [19,20] and layer - wise relevan ce propag ation (LRP ) [21] proje ct m ode l de cis ions ont o inp ut re gions. Archit ectural app roaches inject domain kn owledge th rough parameter ized Gabo r kern els [22] or project deep embe ddings int o rule - defined featur e spaces [23] . Des pite t heir c ontrib uti ons, t hese me thods s hare a funda ment al limit ation: the y ind ica te w here in the inp ut t he m odel a ttends or whic h fea tures it dee ms sal ient, but the y do n ot articul ate why the mod el arriv ed at a specif ic stagi ng d ecision. The resulti ng explana tions — heatmaps, salien cy maps, protot ype s imi lari ty sc ore s — still require exp ert reinter pretati on and cannot be di rectly read as clinical reason ing. Bienefel d et al. [7] showed t hrou gh a lo ngi tudina l mu lti - met hod st udy that cli nicians do not seek mo del - cent ric const ructs such as Shap ley values; rath er , they need “clin ical plausib ility” — ex planati ons framed in th e same diagn ostic lan gua ge and rule system s the y use t hem sel ves. This di sconne ct be twe en w hat c urre nt e xplainable A I (XAI) metho ds provi de and wh at clinicians act ually n eed remain s a central chal leng e [7,15] . V ision - langua ge m odel s (VL Ms), whic h joi ntly proc ess visual inp uts and gener ate free - f orm nat ural langu age, offer a fund amentally different appr oach. In medical imag ing, VLMs have enab led au tomated d iagnost ic report gen eration from r etina l phot ographs [2 4] and f undus f luor esc ein a ngio gram s [25,26] , interacti ve natur al - langua ge - d riven a nalys is of pat holog y sl ides [2 7] , and m ultimodal represe ntation le arning tha t unifies h istopatholo gy images w ith clinical te xt [28] or echoca rdi ogram vide o with expe rt cl inica l reports [2 9] — collective ly demonstrating that vis ual perc eption and langu age reason ing can be jointly l everaged for clin ical decisi on support acro ss diverse s pecialties. Critical ly , howev er , strong benchmar k performance d oes not guar antee reli able reasoning : Jin et al. [3 0] found that among cases where GPT - 4V selected the correct an swer on med ical imag e question s, 35.5% nonet heless contain ed flaw ed rationa les , wit h im age c ompre he nsion e rro rs bei ng the most p reva lent f ailur e mo de (27 .2% ). This findi ng under scores two requiremen ts for deploy ing VLMs in specialized clinical t asks: domain - specific t raining to sharpen 3 visu al perception , and an ind ependen t evaluati on of reason ing quality th at goes beyo nd classif ication accuracy . [24,31] Fig. 1 | Overv iew of Sl eepVLM. The figure is o r ganiz ed i nto thr ee s tage s: PSG signa l proc es sing (le ft) , a tw o - phase traini ng pipeli ne ( cent er), and expla inabl e o utp ut wi th d ual - dimension e valuation (rig ht). Left: Multi - channel PSG signal s ar e ac quire d from a cli nic al pol ysom nogra phy re cor din g usi ng six c han nels (F4 - M1, C4 - M 1, O2 - M1, LOC, ROC, Chin EMG). Raw si gnals are band - pass filtered, resamp led to 100 H z, segmented into 30 - seco nd e poc hs, a nd render ed as standar dized w aveform images (4 48 × 224 pix els). Center: In Phase 1 (wav eform - percep tual p re - traini ng), th e model recei ves a sin gl e epoch image and is train ed to p redict p er - se cond spe ctr al ba nd powe r ( δ, θ, α , β) and mean a bsolu te a mplit ude ( MA V ) for e ac h channe l, wi th b oth the visi on encoder a nd la ngua g e model unfro zen. In Phas e 2 ( rule - grou nde d supe rvi sed f ine - tu nin g), t hree conse cuti ve epoc h ima ges (pr ece din g, c urrent , subs eque nt) are prese nted a longsi de AASM sc oring rule s in ject ed int o the s yste m prompt. The visi on e ncoder is f roze n while the langua ge model is fine - tu ned via low - rank adaptation (LoRA) to gen erate a sleep stage, ap pli cable rule id entifier s, and a str uct ured rati onal e. Ri ght: Unli ke c onve ntiona l black - b ox models that output only a stag e label, Sleep VLM produc es a n audi tabl e, r ule - cited o utput c ompri si ng the pre dict ed s tage , appl ica ble AASM rule s, a nd nat ural - langu age reason ing. Classif ication perf ormance and expert - rat ed reasonin g quality are evalu ated on a held - ou t te st set (MASS - SS1 ) and an external clin ical test set ( Z UAMHCS ). Botto m: Dataset allo cation across training ph ases and e valua tio n: MA SS - SS 2/4/ 5 f or P has e 1, MAS S - SS3 f or Phase 2, MASS - SS 1 f or he ld - out testi ng, and ZUAMHCS for external testing. Here we propose Sle epVL M, a fr am ew ork that addr ess es bot h requir eme nts in t he c onte xt o f sle ep st agi ng. The core desig n achieves a dual ob jective: E xplain able — the VLM generates n atural - lan guage reas oning that desc ribes observ ed wav eform featu res and st agin g logic f or eve ry ep och, an intr ins ic c apabil ity of the vis ion - lan gua ge architect ure; an d Rul e - Grou nded — a s truct ured se t of AA SM s cor ing r ules are explic itly i nje cted i nto the sys tem prompt dur ing tr aini ng, a nd the m odel is req uire d to c ite s pec ific rule ident ifie rs i n its o utp ut, ma king eve ry sta ging decis ion a udit able aga inst the cli nica l sta ndar d. By re nderi ng multi - channel PS G signals as wavef orm images an d prese ntin g conse cuti ve e pochs to a fine - t uned Q wen2. 5 - VL - 3B -Instruct mo del [32] , SleepVLM emu lates th e technol ogist’ s wor kflow of vis ually i nspectin g mul ti - channel traces, id entifyin g characteris tic wavefo rm featu res, 4 applyi ng AA SM rules, and reaching a staging d ecision — while pr oduc ing a com plete wri tte n rec ord of the rea soni ng at eve ry st ep. The mai n cont rib utions of t his wor k are as f ollow s: • Slee pVLM fr ame work. W e pres ent, t o our knowle dge , the fi rst a pplicati on of a visi on - language mo del to explai nab le sl eep s tagi ng. T hroug h a tw o - pha se tra ini ng pipe line — waveform - p erceptu al pre - training followe d by rule - gr ounde d su pervis ed fi ne - tuni ng w ith a mi xture of fine - gr ained and coarse annot ations — a 3 - billion - parameter mo del lear ns to generat e structu red rationale that cites AASM ru les, unifying cla ssification an d explana ti on in a si ngle f orw ard pass . • Dual - dimensio n eva luati on on he ld - out and extern al test set s. W e evaluate Sleep VLM on a held - out test set (MASS - SS1, n = 53 ) and an exter nal clin ical dat aset (ZUAM HCS, n = 100) al ong two co mplementary dimens ions: classificat ion per formance comp arable t o st ate - of - the - art si gnal - based meth ods ( κ = 0.767 a nd 0.743, re spe cti vely) and e xpert - ra ted r eas oning qua li ty exc eedin g 4. 0/5.0 o n all three eva luati on di mens ions (Factu al Accuracy , Eviden ce Comprehen siven ess, and Log ical Coh erence). 4 - bit weight quantization [33] reduces mo del si ze by 55% and accelerate s inf erence by 2.2 × with ≤ 1.6 perc enta ge - point los s in κ , suppor ting depl oyme nt o n a s ingl e c onsume r - grade GPU. • Open - so urce dataset . W e release MASS - EX, an e xper t - annotated dataset of 62 subject s and 59,317 epoch s with AAS M rule labels a nd, for a subset, f ull expert - written rationale s, pro vi ding a pu blic benc hmar k for futur e research o n int erpretabl e sleep stagi ng. Results Classific ation perform ance on held - out and exte rnal test sets Slee pVLM wa s be nchmarked a gainst 1 2 si gnal - based and 2 imag e - base d met hods on b oth t he hel d - out test set (MASS - SS1, n = 53 ) and the ext ernal cli nical t est set ( ZUAM HCS , n = 100 ), with al l baseli nes retrai ned on th e same data s plits and cha nnel conf ig urati on (T a ble 1). On MASS - S S1, Slee pVLM a chie ve d an a cc urac y of 0.835 [9 5% CI 0.824, 0.84 6], m acr o - F1 of 0.79 3 [ 0.777 , 0.80 6], a nd Cohen’ s ka pp a of 0. 767 [ 0.7 50, 0. 782] , pl acing it in the same perfo rmance tier as the leading sign al - based meth od LPSGM ( κ = 0.763 ) and the i mage - based SleepXV iT ( κ = 0.771 ), with con fidence int ervals ov erlapping across all three method s. On the ext ernal clini cal test set ZUAMHC S , SleepVL M attained κ = 0.743 [0.721 , 0.763] , ranki ng se con d on ly to L PSGM ( κ = 0.750) and sur pa ssing a ll othe r base lines incl udi ng Se qSle epNe t ( κ = 0.737), R obustS leepNet ( κ = 0.725), SleepDG ( κ = 0.719), and SleepXV iT ( κ = 0.6 94). The c ross - dom ain r obus t ness of Sleep VLM is notewo rthy: whereas SleepX V iT , w hich ach ieved th e highest accuracy o n MAS S - SS1 (0 .838), dro pped su bsta ntia lly o n ZUAMHCS ( κ = 0.694), Sleep VLM maint ained a k appa reduc tion of o nly 2 .4 pe rce ntage points. C ritic ally , Sle epVL M is the only m eth od amo ng a ll c ompa ris ons t hat simul tane ousl y pro vides inte rpre table, rule - gr oun ded out put — achieving co mparable classi ficatio n perf ormanc e while addi tiona lly gener ati ng str uctured na tur al - la ngua ge ex plana ti ons ci ting AASM r ules for e very sta ged epoc h. Sub ject - level d istributi ons of accuracy , macro - F1 , and kappa ( Fig. 2a) rev ealed co mparable sp read and centr al tenden cy acro ss MASS - SS1 and ZUAMHCS , indicating stable per - subje ct pe rform anc e and r obus t c ross - domain general izatio n. Normali zed confusion matr ices (Fig . 2b) showed that on MASS - SS 1, W (0.92), N 2 (0. 91) , an d R (0.89 ) attained hig h recall, N3 was somewh at lower (0.79), and N1 was th e weakest stag e (0.45), wi th primary confusi ons direc ted t owa rd N 2 (0.3 4) an d W (0.1 2). O n ZUAMHCS , W (0. 86) a nd N2 ( 0 .8 6) remained well clas sif ied, N 3 im prove d to 0.85 , N1 rec all ros e t o 0.5 4 but rem aine d the mos t e rror - p rone stag e, and R recall 5 decreased t o 0.73. The persisten t difficulty wi th N1 reflects the i nherent ly ambiguou s electrophy siologi cal character istics of this tra nsitional stage — human inter - rater agreement f or N1 is approxi mately 63% [5] — and is a recogn ized limit ation sh ared by all au tomated sleep st aging methods. Fig . 2 | Classif ication perf ormance an d expert eval uation of Sleep VLM . a, S ubject - level dist ribut ions of accur ac y , macro - F1, and Cohe n’ s kap pa on t he hel d - out test set (MASS - SS1, n = 53 subjects ) and the extern al clinical test s et ( ZUAMHCS , n = 10 0 s ubjec ts). V iolin plots are ove rlai d wi th b ox plots (me dian, in terqua rtil e r an ge, whi ske rs a t 1.5× IQR) and indi vidual subjec t scores (jitter ed points). b, Row - normalized co nfu sion matri ces on MASS - SS1 ( top) and ZUAMHCS (bo ttom) . Ea ch cel l sh ows t he propor ti on of e pochs with a give n gr ound - tr uth lab el (row) predicted as each stag e (co lumn). c, Dis trib utio n of e xpert rati ngs f or Sl eepV LM - g enerated rationales on MASS - SS1 (top r ow , 530 str atified samples) an d ZUAMHCS (bottom row , 1,000 str atif ied sa mple s), evalua ted o n th ree dime nsions : Factual Accuracy , Evid ence Compr ehensiv eness, and Logical Coherence (each s cored o n an i nteger scale o f 0 – 5). Dashed r ed lin es and anno tations in dicate mean scores. The comp lete rati ng scale is defi ned in S upplement ary T able 1. 0.0 0.2 0.4 0.6 0.8 1.0 S c o re M ASS - SS 1 ZUA MH CS M ASS - SS 1 ZUA MH CS M ASS - SS 1 ZUA MH CS A ccu r a cy Ma c ro -F1 K ap pa W N1 N2 N3 R W N1 N2 N3 R G r oun d Tr u th 0. 92 0. 04 0. 02 0. 00 0. 02 0. 12 0. 45 0. 34 0. 01 0. 08 0. 00 0. 05 0. 91 0. 03 0. 01 0. 00 0. 00 0. 21 0. 79 0. 00 0. 01 0. 04 0. 06 0. 00 0. 89 M ASS - SS 1 W N1 N2 N3 R Predi cti on W N 1 N2 N3 R G r oun d Tr u th 0. 86 0. 07 0. 02 0. 01 0. 04 0. 16 0. 54 0. 25 0. 01 0. 05 0. 02 0. 05 0. 86 0. 06 0. 01 0. 00 0. 00 0. 15 0. 85 0. 00 0. 05 0. 11 0. 10 0. 00 0. 73 ZUA MH CS 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0 50 10 0 15 0 20 0 25 0 30 0 Count 0 1 2 3 4 5 S co r e Fa ct u a l A ccu r a cy 0 50 10 0 15 0 20 0 25 0 30 0 Count 0 1 2 3 4 5 Ev i d enc e C om pr e hen s i v en es s 0 50 10 0 15 0 20 0 25 0 30 0 Count 0 1 2 3 4 5 L og i c al C oh er en c e M ASS - SS 1 0 10 0 20 0 30 0 40 0 50 0 60 0 Count 0 1 2 3 4 5 S co r e Fa ct u a l A ccu r a cy 0 10 0 20 0 30 0 40 0 50 0 Count 0 1 2 3 4 5 Ev i d enc e C om pr e hen s i v en es s 0 10 0 20 0 30 0 40 0 50 0 60 0 Count 0 1 2 3 4 5 L og i c al C oh er en c e ZUA MH CS a b c 6 T able 1 | Performance c omparison of S leepVLM with other baselines on he ld - out and external test sets Method Input moda lity MASS - SS1 ZUAMHCS Accuracy Macro - F1 Kappa Accuracy Macro - F1 Kappa AttnSleep [10] Signal 0.801 [0.779, 0.821] 0.756 [0.731, 0.777] 0.723 [0.693, 0.750] 0.752 [0.731, 0.771] 0.701 [0.679, 0.722] 0.666 [0.640, 0.690] DeepSleepNet [8] Signal 0.803 [0.783, 0.821] 0.751 [0.724, 0.773] 0.725 [0.696, 0.750] 0.749 [0.728, 0.768] 0.701 [0.676, 0.722] 0.658 [0.630, 0.683] LPSGM [34] Signal 0.831 [0.817, 0.843] 0.794 [0.777, 0.809] 0.763 [0.745, 0.781] 0.818 [0.797, 0.833] 0.775 [0.752, 0.793] 0.750 [0.722, 0.771] ResnetMHA [35] Signal 0.784 [0.758, 0.807] 0.761 [0.729, 0.787] 0.709 [0.675, 0.739] 0.751 [0.729, 0.772] 0.712 [0.686, 0.735] 0.667 [0.637, 0.694] RobustSl eepNet [36] S ignal 0.815 [0.800, 0.829] 0.755 [0.733, 0.774] 0.738 [0.718, 0.758] 0.799 [0.783, 0.813] 0.757 [0.741, 0.771] 0.725 [0.704, 0.743] SalientSleepNet [37] Signal 0.787 [0.770, 0.804] 0.722 [0.701, 0.741] 0.699 [0.675, 0.721] 0.778 [0.758, 0.795] 0.732 [0.710, 0.750] 0.693 [0.668, 0.716] SeqSleepNet [38] Signal 0.799 [0.782, 0.815] 0.742 [0.720, 0.761] 0.712 [0.684, 0.736] 0.812 [0.793, 0.827] 0.757 [0.739, 0.773] 0.737 [0.713, 0.758] SleepDG [39] Signal 0.812 [0.794, 0.827] 0.762 [0.742, 0.779] 0.735 [0.709, 0.756] 0.798 [0.777, 0.816] 0.749 [0.725, 0.768] 0.719 [0.690, 0.743] T inySleepNet [9] Signal 0.829 [0.81 1, 0.845] 0.779 [0.758, 0.799] 0.760 [0.735, 0.782] 0.786 [0.764, 0.806] 0.741 [0.718, 0.763] 0.712 [0.684, 0.738] U- Sleep [ 11 ] Signal 0.812 [0.795, 0.827] 0.749 [0.727, 0.766] 0.732 [0.708, 0.753] 0.781 [0.764, 0.797] 0.720 [0.702, 0.737] 0.697 [0.674, 0.718] U- Ti m e [12] Signal 0.770 [0.748, 0.790] 0.703 [0.676, 0.726] 0.675 [0.643, 0.703] 0.776 [0.756, 0.793] 0.715 [0.693, 0.734] 0.691 [0.665, 0.714] XSleepNet [40] Signal 0.780 [0.756, 0.802] 0.71 1 [0.684, 0.737] 0.689 [0.656, 0.718] 0.727 [0.702, 0.751] 0.661 [0.637, 0.686] 0.634 [0.601, 0.665] SleepXViT [21] Image 0.838 [0.825, 0.850] 0.793 [0.775, 0.809] 0.771 [0.752, 0.788] 0.776 [0.757, 0.795] 0.724 [0.705, 0.745] 0.694 [0.668, 0.719] Res N et - 18 [41] Image 0.830 [0.815, 0.843] 0.778 [0.760, 0.793] 0.758 [0.737, 0.778] 0.793 [0.771, 0.812] 0.747 [0.723, 0.768] 0.717 [0.687, 0.742] SleepVLM (ours) Image 0.835 [0.824, 0.846] 0.793 [0.777, 0.806] 0.767 [0.750, 0.782] 0.812 [0.796, 0.827] 0.766 [0.749, 0.782] 0.743 [0.721, 0.763] All baseline methods were reimplemented from their publ icly released code and trained on the same data splits and channel configurat ion used in this work. The best result in each column is shown in bold. Classification performance values are point estim ates with 95% confidence interv als [lower, upper] derived from 1,000 subject - level cluster bootstrap resam ples. Expert eval uation of reasoning qua lity T o evalu ate the clini cal qualit y of SleepVLM’s rationale beyond classifi cation accuracy , a trained sleep t echnolog ist rated mo del - gener ated ratio nales on thr ee dimension s — Factual Accuracy & Perceptu al Fidelity , Diagnos tic Evide nce C ompre hensivenes s, an d Lo gical Coherence & Guid eline Conco rdance — each sco red on an integ er scale of 0 – 5 (Supp lem enta ry T able 1). Evalua tion s am ples were cons truct ed by dra win g 10 str atif ied - r andom epoc hs per subj ect from each test set (530 epochs fro m MASS - SS1 an d 1,0 00 e pochs fro m ZUAMHC S ), ens uring ba lanc ed represen tatio n across all five sleep st ages. Sleep VLM achieved mean sco res of 4.24, 4.14 , and 4.15 on MASS - SS1 7 (com posite 12.5 3/15) and 4. 09, 4.01, an d 4.04 o n ZUAMHCS (compos ite 1 2.14/ 15) f or the thre e dimensi ons, respe cti vely (F ig. 2c ). All six dim ension – data se t combi nati ons e xceeded 4. 0 (“G ood” level ), i ndic ating t hat the model’s reasoning con sistent ly describes wavefo rm featu res with high fid elity , marshals mult i - ch annel evidence systemati cally , and appli es AASM rules in a logi cally coh erent manner . The cross - do main declin e was modest ( Δ compos ite = −0.39) , su ggest ing t hat reas oning q uality gener aliz es alongs ide classificati on p erforman ce. The ra tin g dis trib utions (F ig. 2c ) f urther ill umi nate d the consi stenc y of rea sonin g qual ity . O n MASS - S S1, s core s of 4– 5 acc ounte d for 87.7% (F actua l Accura cy), 8 8.9% (Evi denc e C ompr ehens ive ness ), an d 88 .7% ( Log ic al C ohere nce) of th e evaluat ed sampl es. On ZUAMHCS , thes e pro porti ons we re 83. 0%, 84.2%, a nd 81. 9%, r esp ec tivel y . Low - qualit y out puts (score s 0 – 1) we re i nfre quen t on MAS S - SS 1, ra ngin g f rom 2 .6% ( Factual Acc urac y) t o 6.2% (L ogic al Cohere nce ), but somewhat more freque nt on ZUAMHCS — Factual Accu racy sco re o f 1 a ccou nt ed f or 1 1.0% (1 10/1,000) of sa mple s — likely refl ecting the greater w avefor m complex ity and signal - quality variability present in an in depen dent c li nica l rec ordi ng enviro nme nt. Qualitative e xamination of indivi dual model outputs (Fig. 3) illustrate d how Slee pVLM emulate s the s tructured workf low of a sl eep technolog ist. In t hree cor rectly stag ed examples, th e model id entified stag e - defin ing wave f orm features — al pha rhy thm ex ceeding 50% of the e poch and c onju gate eye blinks for W (Rules W .1, W . 2); a K - compl ex and sle ep s pindle for N 2 (Rul e N2.1); a nd low - amplitude mix ed - freque ncy EE G, r apid e ye mo vem ents , and low chin E MG t one f or st age R (Rule R.1) — pr oviding c hanne l - specific obs erva tions, qua ntita tive des cript ions of t imi ng and mo rph olog y , a nd exp lic it e xclus iona ry r eas oning aga inst al ter native sta ges. A fourth exam ple expos ed a compou nd f ailure : f or a gr ound - truth N1 epoch, Sl eepVLM corr ectly id entified a K - comple x but repor ted a no nexi stent sle ep spi ndle, cite d Rul e N2.1 , and pr e dicte d N2, whe r eas exper t revi ew de term ined t hat the K - complex is arou sal - associ ated and the appli cable ru le is N2.4 (N2 termin ation, rev ersion to N1) . This case sho ws that the model can err in both f eature perce ption an d rule appl ica tion , consi ste nt wit h the N 1 - to - N2 conf usio n in the confusi on matri x (Fi g. 2b) a nd the w ell - d ocum ented di ff icul ty of disambi guati ng t hes e tra nsit ion al s tages . The rationa le f oll owed a consis te nt out put struct ure — channel - level obs ervation, feature id entificati on, rul e citation , exclus iona ry re aso nin g, a nd st agi ng conc lus ion — mirroring the systematic ap pro ach used by clin ical scorers an d enabli ng dir ect a udi tabili ty a gai nst the AA SM sta ndar d. A ddit io nal qua lita tive exa mple s, inc ludi n g correct N 3 stagi ng, a pplic ati on of the m ajor body m ove me nt rule s ( MBM. 1, MB M.2), and f urt her e rror c ase s, are provi ded i n Supple ment ary F igur es 1 a nd 2. 8 Fig . 3 | Qualit ative exam ples of rul e - gr ounded model output . F our represen tative examp les of SleepVL M outpu t s spanni ng sta ges W , N2, R, an d N 1. Eac h pane l s hows t hree conse cuti ve 30 - second PSG epoch images ( precedi ng, curre nt, subse quent ) w ith s ix c ha nnels (F4 - M1, C4 - M1 , O2 - M1, LO C, RO C, Chi n EMG ), t he m ode l’ s pre dic ted sleep stag e, cited AAS M rule identi fiers, an d the compl ete rational e. Check marks ( ✓ ) d enote correct classificat ions 9 and crosses ( ✗ ) de note mis cla ssific ati ons a gains t the ground - truth label . The first three pan els illustrat e correct stagi ng: W a kef ulnes s iden tif ied via alpha rhythm and e ye bl inks (R ules W .1, W .2), N 2 iden tifi ed vi a a K - complex and s lee p spi ndle (R ule N 2.1) , a nd s tage R identifie d via low - amplitud e mixed - frequ ency (LAMF) EEG acti vity , rapid e ye m ovem ents , and low c hin EMG tone (R ule R .1). The f ourt h pa nel show s a m isc las sifi cat ion in whic h the model predic te d N2 for a gr oun d - tr uth N 1 epoc h. The model cit ed Rul e N2.1 bas ed on a perc eive d K - complex and sleep sp indle; ho wever, expert review det ermined that the sl eep spindle is no nexisten t (percept ual halluci nation ) and that the K - complex is arousal - as sociated , making the appl icable rul e N2.4 (N2 termi nation , re vers ion t o N1) ra the r than N2. 1. PS G channels ar e colo r - c oded a s des cribe d in M etho ds. T he complet e def initi ons of all cite d AASM rule s are provide d in T able 4. Rational es have b een conden sed for brevity wi thout alt ering the sub stantive mean ing. Additi onal qualitati ve examp les are provid ed in Supp lementary F igures 1 and 2. Ablation analysis A sy stematic ab lation stu dy evaluated the contri bution of each training co mponent to classificat ion perfor mance across sev en confi guration s, including a zero - s hot basel ine (T able 2). The zero - shot Qwen2.5 - VL - 3B - I nstruc t model , rec eiving the com plete system prompt with AASM rule s but no ta sk - specific training , achieved near - chance perfo rmance (MAS S - SS1 κ = −0.0 03; ZUAMHCS κ = −0. 013) , confi rming that domai n - specific fin e - tuni ng is ess entia l for this ta sk. R em oving wave form - perceptu al pre - training (w/o WPT) r educed kapp a by 3. 1 percen tage points on MASS - SS1 (0. 767 → 0.736 ) and 1 .6 p p on ZUAM HCS (0.743 → 0.727), ind icat ing that pre - train in g on per - second sp ectral and amp litude target s strengthen s the model’s visual feature ex traction . Removing rule groun ding f rom the s yste m pr omp t and trai ning tar get s (w/ o Rule Gr oundi ng) produc ed a sm all er but n otable i mpact on MAS S - SS1 ( Δκ = −0.9 pp) and a larger decrement on ZUAMHCS ( Δκ = −2.3 pp, 0.743 → 0.720), suggestin g that e xplic it rule a nchorin g act s as a form of kn owl edge regulari zation that p articu larly ben efits cro ss - do main gener aliz ation. T he importanc e o f tra inin g data s ca le wa s demons tra ted by t he w /o Coar se Ann ota tion c ondit ion, where rest rictin g supervis ed fine - tuning to t he f ive fi ne - annotated subjects cau sed a substan tial decline ( Δκ = −10.0 pp on MASS - SS1) ; fur ther rem ovin g WPT (w /o Coa rse Annot . & WPT) low ere d kap pa t o 0.5 26. A dding re jec ti on sampling fine - t uning (+ RFT) did not im prove ove r Sl eepV LM ( MASS - SS1 κ = 0.728; ZUAMHCS κ = 0.727), potent ially becau se t he per plexi t y - gain - selecte d r atio nale s intr oduced di strib uti onal shif t d uri ng t he ad diti onal traini ng phase. Expert r ati ngs of re aso ning quali ty a cros s the four pri ncipa l c onfi gura tio ns (Fi g. 4) reve ale d com ple ment ary tr ends. Slee pVLM a chieve d the hi ghest sc ores on all thre e di mens ions on MA SS - SS1 ( 4.2 4/4.1 4/4 .15) a nd mainta ined competi tive p erforman ce on ZUAMHCS (4.09/ 4.01/ 4.04) . Rem ovi ng WPT prima rily im pair ed Fa ctual Acc urac y (MASS - SS 1: 4.24 → 3.95, Δ = −0.29; ZUAMHCS : 4.09 → 3.96, Δ = −0.1 3), wi th smaller effects on Eviden ce Compr ehensiven ess and Logical C oherence, consisten t with the rol e of perceptual pre - tr ai ning i n enhan cing wave form de sc ripti on f ide lit y rat her t han rea sonin g str uct ure. Rem oving rul e gr oun ding had a prono unce d ef fect in the cro ss - domain setting: on ZUAMHCS , Fac tual Acc urac y droppe d from 4.09 t o 3.85 ( Δ = −0. 24) and Evi dence Compre hens ive ness from 4. 01 t o 3.89 ( Δ = −0.12) , yiel din g a com posite dec rea se of 0.41 poi nts. The +RFT config urat ion s howed mixe d res ul ts — lower than Sleep VLM on MASS - SS1 (comp osite 1 1.8 0 vs. 12. 53) y et sligh tly highe r on ZUAMHCS for Lo gic al Cohe renc e (4. 09 vs. 4.0 4) — sugg esti ng that th e effect of rej ection samp ling is sensiti ve to data dist ribution ch aracteristics. Consi stent with thes e mean scores, S leepVLM also yield ed the largest number of score - 5 rati ngs on both dat ase ts while ke epin g t he prop ort ion of low sc ores (≤ 2) amo ng the low est across 10 config urat ions . Table 2 | Ablation st udy: cont rib ution of ea ch tra ining co mponent t o class ification perf ormance Configurat ion WPT SFT data Rule groundi ng RFT MASS - SS1 ZUAMHCS Accura cy Macro - F1 Kappa Accura cy Macro - F1 Kappa Zero - shot baseline ✗ — — ✗ 0.100 [0.087, 0.114] 0.070 [0.064, 0.076] − 0.003 [−0.006, −0.001] 0.201 [0.186, 0.216] 0.104 [0.098, 0.108] − 0.013 [−0.016, −0.010] SleepVLM ✓ Fine + Coarse ✓ ✗ 0.835 [0.824, 0.846] 0.793 [0.777, 0.806] 0.767 [0.750, 0.782] 0.812 [0.796, 0.827] 0.766 [0.749, 0.782] 0.743 [0.721, 0.763] w/o W PT ✗ Fine + Coarse ✓ ✗ 0.818 [0.804, 0.831] 0.753 [0.735, 0.770] 0.736 [0.715, 0.755] 0.804 [0.791, 0.818] 0.743 [0.728, 0.758] 0.727 [0.708, 0.747] w/o Rule Grounding ✓ Fine + Coarse ✗ ✗ 0.829 [0.817, 0.840] 0.785 [0.768, 0.798] 0.758 [0.740, 0.773] 0.795 [0.777, 0.813] 0.746 [0.725, 0.765] 0.720 [0.695, 0.744] +RFT ✓ Fine + Coarse ✓ ✓ 0.810 [0.794, 0.824] 0.738 [0.718, 0.755] 0.728 [0.705, 0.749] 0.801 [0.779, 0.820] 0.737 [0.712, 0.756] 0.727 [0.699, 0.753] w/o Coarse Annotati on ✓ Fine only ✓ ✗ 0.767 [0.749, 0.782] 0.700 [0.685, 0.711] 0.667 [0.641, 0.690] 0.760 [0.740, 0.779] 0.680 [0.660, 0.699] 0.668 [0.641, 0.694] w/o Coarse Annot. & WPT ✗ Fine only ✓ ✗ 0.673 [0.649, 0.694] 0.584 [0.562, 0.603] 0.526 [0.497, 0.553] 0.699 [0.680, 0.716] 0.571 [0.553, 0.587] 0.565 [0.540, 0.587] WPT, waveform - percept ual pre - training; SF T, supervised fi ne - tuning; RFT, reject ion sampling fine - tuning. ‘ Fine + Coarse ’ , 5 fine - annotated and 45 coarse - annotated MASS SS3 subjects; ‘Fine only’, 5 fine - annotated subjec ts only. SleepVLM is shown in bold. Classific ation performance values are point estimates with 95% confidence interv als [lower, upper] deriv ed from 1,000 subj ect - level cluster bootst rap resamples. The best result in each column is shown in bold. 11 Fig . 4 | Ablati on st udy: exper t e valu ati on of r easoni ng qu alit y ac r oss m odel conf igur atio ns . Expert ratin g score distri buti ons for four mo del con figur ati ons o n MAS S - SS 1 (lef t, 530 s tra tif ied sa mple s pe r conf igura ti on) an d ZUAMHCS (rig ht, 1,0 00 s trat ifie d samples per confi gura tio n). Ea ch stacke d hor iz ontal bar s hows the n umbe r of samples recei ving each integ er score (0 – 5, color - cod ed) for three evalu ation dimen sion s: Factual Accuracy , E vidence Compr ehensiven ess, and Logical Coheren ce. Mean sco res are annotated to the ri ght of ea ch bar . F or bre vity , c ount labels ar e omitted from bar seg ments rep resent ing fewer t han 4% of the sampl es in a given configuratio n. The four config urat ions are : Slee pVLM ( propose d metho d), w /o WPT (witho ut wave form - perceptu al pre - training) , w/ o Rule Groun ding (w it hout AA SM r ule i denti fier s in t he s ystem promp t a nd trai nin g tar gets ), a nd +R FT (wi th addi tio nal rejectio n sampl ing fine - t un ing). The sam e se t of e poc h ide ntifie rs was eva luate d acr oss a ll co nfi gurat ions . The complet e ratin g scale is def ined in Suppl ementary T able 1. Model quanti zation preserv es performanc e for clinical deployment T o ass ess the fe asibil ity of depl oying Sl eepV LM i n res ource - const ra ined c linic al e nvironm ents , we applie d W4A16 post - trai ning quantizatio n (4 - bit weights, 16 - bit activatio ns) using Intel AutoRound [3 3] (T able 3). Quant ization reduc ed t he mo del size from 7. 1 GB to 3.2 GB (−54.9% ) and in cre ase d infer enc e thr ough put from 1.89 to 4.15 epochs per sec ond ( +2. 20×) on a s ingle NVID IA R TX 4 090 G PU, ena blin g de ploym ent o n co nsume r - grad e hard ware. Classifica tion p erforman ce degradat ion was min imal: accu racy decreas ed by 0. 8 p p on M ASS - SS1 ( 0.835 → 0.82 7) and 1. 4 pp o n ZUAMHCS (0. 812 → 0.79 8); ka ppa de cre as ed by 0.9 pp on MASS - SS1 (0.767 → 0.758) and 1.6 pp on ZUAMHCS (0. 743 → 0.727). Notably , the quanti zed model’ s kappa on Z UAMHCS (0.72 7) stil l surpa sse d the majority of full - preci sio n si gnal - base d bas eline s, i nclu ding S leep DG (0.719) , T inySle epNe t (0 .71 2), an d U - Sleep (0.697) . Quantiz ation pres erve d the mode l’ s a bilit y to ge nerate s truc tured rat iona les with AASM r ule c itat ions , mai ntain ing the i nterpretab ility character istics essential for clinic al adoption. Table 3 | Effect of W4A16 quantization on resource efficiency and classificatio n performance Full - precision ( BF 16) Quantized (W4A16) Δ Resource Model siz e 7.1 GB 3.2 GB − 54.9% Inference s peed 1.89 epoch/s 4.15 epoch/s +2.20× MASS - SS1 Accur acy 0.835 [0.824, 0.846] 0.827 [0.815, 0.838] − 0.00 8 Macro - F1 0.793 [0.777, 0.806] 0.788 [0.772, 0.801] − 0.005 Kappa 0.767 [0.750, 0.782] 0.758 [0.740, 0.774] − 0.009 ZUAMHCS Accur acy 0.812 [0.796, 0.827] 0.798 [0.782, 0.813] − 0.014 Macro - F1 0.766 [0.749, 0.782] 0.751 [0.733, 0.769] − 0.01 5 Kappa 0.743 [0.721, 0.763] 0.727 [0.705, 0.748] − 0.016 W4A16, 4 - bit weight with 16 - bit activ ation post - training quanti zati on [33] . I nferenc e benchmarked on a single NVIDIA RTX 4090 (24 GB) GPU. For model s ize and inference speed, Δ indic ates percentage reducti on and fold - increase i n throughput, respectively. For cl assification metrics, Δ indic ates the absolute change between the quanti zed and ful l - precisi on model. Classification performanc e values are point estimates with 95% confidence intervals [lower, upper] derived from 1,000 subject - level cluster bootstrap resamples. The best result in each column is shown in bold. 12 Discuss ion This s tud y pre sent s, to our k nowl edge , the fi rst a ppli cat ion of a vis ion - language model to expl ainable sleep st aging. In a sin gle fo rward p ass, Sleep VLM gen erates bo th a sleep stage l abel and a struct ured rationalet hat expl icitly ci tes AASM rules, directly ta r geting th e interpretability gap that has been identifie d as a major barrier to c linical adopti on of aut oma ted sl eep sta ging [6, 7]. On t he held - out test set MASS - SS1 and the extern al clinical test set ZUAMHCS , Slee pVLM a chieve d Cohen’ s ka ppa of 0. 767 an d 0.743, resp ectively , placing it in the same perfo rmance tier as t he stronge st sig nal - ba sed ba sel ine LPS GM (0. 763/0. 750) and the image - base d Slee pXV iT (0.771/ 0.69 4), with overla ppin g co nfide nce i nter vals . The c ross - domain k appa declin e was 2.4 percentage points for Slee pVLM compa red w ith 7. 7 for Slee pXV iT , indicatin g mor e st able c ross - coho rt be hav ior , altho ugh th is c ompa rison a lone does no t identi fy the under lying mechanis m. Beyo nd classifi cation accu racy , ex pert eval uation confirmed th at the model’s reasoning met a “G ood” stan dard acr oss all th ree evalu ation dimen sions on bo th dataset s (MASS - SS1 : 4.24 /4.14 /4.15 ; ZUAMHCS : 4.09 /4.0 1/4.0 4), wi th a cros s - doma in compos ite dec line of o nly 0 .39 p oint s. Amon g all compa red m ethods , Sl eepV LM w as uni que i n prov idi ng an a udit able, rule - cited rat ionale fo r every stag ed epoch while mai ntaining classificat ion perfo rmance comp arable to current lead ing appro aches, suggest ing that cl inician - readabl e reasoni ng need not come at a prohi bitive per formance cost . W4A16 post - t rain i ng quantiz ation [33] further reduc ed mo del s ize fr om 7. 1 GB to 3.2 G B a nd inc rea se d thro ughp ut from 1.89 t o 4.1 5 e pochs per se cond w it h kap pa losse s not e xceeding 1. 6 percent age po ints , sup por ting de ployment on co nsum er - g rade hardwar e. Ablat ion a nal yses shed li ght on w hy the fr ame work is e ff ect ive. Re movin g w ave form - perceptual pre - training (WPT) reduc ed kap pa b y 3.1 pe rce ntage poi nts o n MASS - SS1 and 1.6 on ZUAMHCS , with the largest impact on Fact ual Accuracy ( MASS - SS1: 4. 24 → 3. 95; ZUAMHCS : 4 .09 → 3 .96) and smaller effects on Evidence Compre hens ive ness and Lo gica l Coher ence . This patte rn is c onsis te nt wit h W PT prim aril y stre ngthe ning perce ptual fideli ty: it he lps the m odel desc ribe PSG m orp holog y more fai thf ully r at her tha n reor ganiz ing rea sonin g str ucture. By cont ras t, re movi ng rul e gr oun ding pr oduced a m odest in - domain effect but a larg er cross - domain p enalty ( Δκ = −0.9 pp on M ASS - SS1 ver sus − 2.3 pp on ZUAMHCS ; rea sonin g c omposi te Δ = −0.41 on ZUAMHCS ), sugge sti ng that e xplic it rul e gr oun ding m ay suppl y a dom ain - s table pr ior t hat const rains inf ere nce w hen re cor ding c ondit ions differ . Data scale and an notation granul arity also matt ered. Add ing 45 coar se - annot ate d subjec ts, w hich pr ovide d only st age labe ls a nd rule ide ntifi ers witho ut full rati onales, re st ored 10.0 perc enta ge points of kappa on MASS - SS1 rela tive t o trai ning on the five fine - ann otat ed subject s alone. Thi s result has practical signifi cance becau se it in dicates that e xpa nding m orphologica l c over age t hroug h low er - cost coa rse annot atio n is a via ble s trat egy f or sc aling the traini ng p ipeli ne without requi ri ng ex haust ive exper t - writ ten rationale s. Rejection s ampling fine - tuning di d not improve pe rform ance an d in s om e se ttings degr aded it, pot ent i ally because perpl exity - g ain - based selecti on favo red stylistically s mooth b ut distributio nally aty pical rati onales. Ove rall, the ablatio n evidence s u pports a c o mplementary view o f the p ipeline: WPT helps the mo del p erceive P SG wavefo rms more fait hful ly , wh ile rule gr ound ing he lps it reason mo re con sistently i n clinically meaningf ul terms. More broa dly , our fi ndin gs hig hli ght a distinc ti on be twee n two l evels of interpre tab ilit y in sl eep s tagi ng. The que sti on is n ot on ly w here the mo del at te nde d, b ut w hethe r it can expla in w hy a give n e poch sho uld be s core d as W , N1, N2 , N3, or R un der the AASM fra me work. Exist ing XAI appr oache s a ddre ss pa rt of thi s pr oblem: att ent ion visua liza ti on highli ghts whic h tem poral segm ents rec eive high m odel foc us [6] , prot otype - based lear ning r elates pred iction s to learned wavefo rm archetyp es [18 ] , and att ribu tion t echniques such as Grad - CAM and lay er - wise relevance propa gatio n ma rk i nflue ntia l inp ut r egio ns [19,21] . Y et these out puts remain mo del - cent ric artif acts that req uire exp ert reinter pretati on before t hey can serv e as clin ical jus tificatio n. SleepVLM shifts the explanat ion into cli nician - facing langu age. Its rat ion ale describ es chan nel - specif ic obse rvat ions , na mes elec trophysi ol ogica l fea ture s using s tan dar d 13 sleep ter minol ogy , cites expli cit rule identi fiers, and uses excl usionary l ogic to argue against al ternative s tages. This makes each st aging decision au ditable agai nst the op erational ized criteria i n T able 4 and align s with the noti on of clinical pla usibil ity advocated by Bienefeld et al. [7] , wh o s howed t hat c lini cians ne ed e xplana ti ons f ramed in their own diagn ostic vo cabulary and rule systems r ather th an abstract model - cent ric metrics. At t he same time, th e ability to pro duce fluent clinical lan guage do es not guar antee that t he content i s correct. As r eported fo r Sever al li mita tions s hould be a cknow led ged. F irs t, al l rea soni ng - q ualit y scor es w ere ass igned b y a single trai ned slee p tec hnolog ist . Alth ough the r ubric us ed anc hore d de scr ipti ons f or ea ch sc ore l evel ( Suppleme nta ry T able 1) , a single - rater desi gn cannot quantif y inter - rater reli ability or fully excl ude sco rer - specific pref erences. F uture stu dies should ado pt m ulti - rater designs and report agreemen t statist ics such as intraclas s correlat ion coefficien ts or weight ed ka ppa . Sec ond, the t ra ining a nd e valua tio n data sp an a limited numb er of center s and reco rding enviro nme nts. Altho ugh the two t es t se ts ext end cove rage beyo nd the tra ining ar chi ve and, i n the case of ZUAMHCS , to an inde pende nt c lin ica l sit e, br oade r m ulti - center validation acro ss div erse acquisit ion setting s and clinical popu lations r emains neces sary . I n addit ion, the 15 AASM rules op eration alized in this study (T able 4) are rest ricted to the a dult s cori ng cr iteria; the AASM Manu al defines sep arate ru les for children and inf ants [3] . As an initia l demons trat ion, t he present w ork focuse d on t he a dult rule s et; e xt endi ng r ule - groun ded a nno tatio n a nd tra inin g to incorp orat e pe diat ric a nd i nfa nt sc orin g logi c a lo ngside br oader mul ti - center data is a clear p riority . T hird, the mo del can pr oduce output s tha t dive r ge from clini cal ground tr uth through perc eptua l hall uci nati on or c ontext ual rule misa pplic ation , co nsis tent w ith know n VLM fail ure m odes [30] . In the quali tative erro r analysi s, such error s were concen trated at ambiguou s stage bound aries, as exemp lified by the N1 - to - N2 miscl assificati on in whi ch the model hallu cinated a sleep spi ndle and misiden tified an arou sal - associat ed K - complex as meet ing the non - arou sal criter ion of Rule N2.1 . The e levat ed pr oport ion of low F act ual A cc urac y score s on ZUAMHCS (score of 1 in 1 1.0% of evaluat ed sampl es) furt her indicates that cro ss - domain s igna l va riab ilit y can a mpli fy s uch pe rce ptual er rors . Fourt h, the cur re nt i nput desi gn impos es t wo i nform atio n c onstr ai nts: rende ring PSG si gnals a t a f ixed res olut ion ine vit ably disca rds f ine tem poral and am plit ude de tai l pre se nt in the orig ina l re cordi ngs, an d res tric ting the i nput t o t hree conse cutive epoc hs l imits the t empor al cont ex t avai lable for re so lvin g trans iti ons tha t ma y benef it from a longe r scori ng wi ndow . F ina ll y , N1 remained t he most difficu lt stage (recal l 0.45 on MASS - SS1, 0.54 on ZUAMHCS ), consistent with its in trinsicall y ambiguous electrophysi ology [5] . The pres ent work di d not em ploy cla ss - balanci ng techni que s suc h as w eighte d loss func tions , w hich m ay ha ve c ontr ibute d to t he lowe r re ca ll of under repr ese nted stages and merits investigati on in future itera tions. These li mitation s define a clear agen da for futur e work. The most immediate p riority is bro ader valid ation acro ss multi ple cent ers, recor ding syst ems, and cli nical popu lation s, together with t raining d ata that encomp asses a wider range of re cordi ng enviro nme nts. Incor pora tin g the AASM sc oring r ules for c hil dren a nd i nfant s, alo ng wi th m atc hed pediat ric c or pora , wo uld e xte nd the fr ame work bey ond a dult sle ep st agi ng. A second prior ity is t o stan dar dize rea soning evalua ti on it self by e sta blis hing m ulti - rater b enchmarks for model - gen erated exp lanations r ather than trea ting la bel a ccur acy a s the sole endpoi nt. On t he m odelin g side , large r VLM a rchit ect ures , hybri d signal - image inputs , an d longe r te mpor al c onte xt win dows m ay he lp a ddres s cur rent i nform ati on c onstra i nts a nd impr ove classificat ion at amb iguous st age boundari es. Human - in - the - loop refin ement, in whi ch clinician s review and correct model - generate d rea soni ng wit h corr ecti ons fe d back i nto i ter ative fine - tu ning, o ff ers a nat ural pa th to progres sive l y impr ove both c las sifi cati on a nd expl anat ion qua lity . More broa dly , t he ru le - ground ed VLM strategy demo nstrat ed here may e xtend to ot her rul e - g ove rned PSG inte rpre tati on tasks s uch as res pira tory eve nt scoring and m ovem ent event classific ation, a nd potentia lly to other medical signal interpret ation task s govern ed by explici t clinical guidel ines. W e hope th at the publ ic release of th e dataset , cod e, and mod el weigh ts will facilitate co mmunit y efforts 14 towar d more rigorous benc hma rks for reasoni ng q ualit y i n inte r preta ble s lee p sta gin g. In sum mar y , Slee pVL M establi shes a framewo rk fo r moving au tomated sl eep stag ing from lab el - only pre dic tion t owar d audi table , r ule - groun ded cl ini cal r eas oning, pro vidi ng a fou ndati on f or tr ustwor thy huma n – AI collabor ation in sl eep medicine. Methods Datasets W e us ed the Montr eal Archi ve of Sleep St udi es ( MASS) , a n ope n - access collectio n of whole - night PSG rec ordings [42] , as t he pr ima ry da ta s ource for mode l deve lopm ent and held - out testing. All MASS s ubsets used in thi s study ha d a sa mpling f req uenc y of 2 56 Hz for E EG, EO G, an d EMG c ha nnels . MASS - SS 3 (6 2 subj ects ) se rved a s the develo pme nt co hort for s uper vis ed tra ini ng and validation. W ithin MASS - SS3, 5 s ubje cts wer e as signe d to t he f ine - anno tated traini ng subset, 45 subjects to the co arse - annota ted tr ai ning s ubset, and the re mai ning 1 2 s ubject s t o the valida tion s ubs et. MA SS - SS 2 (19 subjects), MASS - SS4 (40 s ubjects ), and MA SS - SS5 (26 su bjects) were u sed exclus ively fo r Phase 1 wav eform - percep tual p re - traini ng. These three subse ts w ere origi nally sc ored us ing Rechtsch affen and Kales ( R&K) criteria an d used 20 - s e pochs, but Phas e 1 di d not use slee p sta ge l abe ls; t here fore , these d ifferences did not affect the superv ision target. MASS - SS1 (53 sub jects) was res erved as a held - out test set throug hout mode l de velop ment. W e f urther eval uated the mode l on the Z hejia ng U nive rsit y A ffilia ted Mental Healt h Cent er Sleep dat aset ( ZUAMHCS ), an extern al clinical test set comprising 100 adult subj ects rando mly selected f rom patien ts who und erwent clinical PSG between 20 20 and 2025. Recor dings were acqu ired using th e Australian Compu medics Grael polysom nogr aphy s yste m a t a sam pling fr eque ncy of 51 2 Hz a nd sc ored a cc ording t o AASM cr iter ia . ZUAMHCS was used e xclusivel y for external e valua tio n and wa s not involved in a ny tra ining , va lida tion , c hec kpoint se lec tion, or mod el desi gn decisi ons. Data co llect ion was con ducte d at Z hej ia ng Uni versi ty wit h Ins tit utiona l R evie w Boa rd approv al, and writ ten consent was o btained fro m all subjects or their car egiver s. Dataset ch aracteristics an d sleep stage distr ibut ions for a ll c ohorts a re s umma riz ed in Su pplem enta ry T ables 2 a nd 3, respec ti vely . T o sup port rul e - gro unde d su per vis ion, w e co nstr ucte d MAS S - EX, an ex pert - ann otated dataset b ased on all 62 MASS - SS3 s ubje cts and the ir 59, 317 ori gina l ep ochs. The under lyi ng ru le libr ary c omprises 15 adul t s lee p stagi ng rules deri ved from the AA SM Manua l f or t he Sc ori ng of Sl ee p an d Associa ted E vents , V e rs ion 3 [3] , and operat ionalized for th e six - chann el m ontage used i n this study ( T a ble 4) . The r ule set w as de velope d joi ntly by a trained sleep t echnolog ist and a senior sl eep medici ne phy sician with over a decad e of cli nical ex perience. Because the mode l in put f or su pervis ed st aging used a prec edin g – current – s ubseq uent t hree - epoc h win dow , t he fi rst a nd las t epoch of e ac h rec ordi ng co uld not s erve as the c en ter e poc h and we re the ref ore e xcl ude d fr om a nnota tio n. This yielde d 5 9,193 an notate d c entra l epoc hs . Am ong them , 5, 006 epoc hs fr om 5 su bject s re cei ved fi ne - grained annotat ions consisti ng of appl ica ble r ule iden tifie rs, a free- te xt rationale , and a sleep stag e label. The remaining 54,18 7 epoc hs from 57 s ubje c ts received coar se annotat ions consi sting of app licable rule iden tifi ers and a sleep stage label only . Annot ati ons we re pr oduc ed th rough a n expert - drive n, ma chine - assis ted pipeline: the two experts firs t authore d high - quali ty exempl ar a nnotati ons f or each sleep stage; a lo cally dep loyed Q wen2.5 - VL - 72B - Instr uct model th en generated d raft annotati ons for all target epo chs using these ex emplars as few - shot demons tra tions ; the sleep tech nologi st manually r eviewed and correct ed every g enerated ann otation ; and th e se nior physi cia n indepe ndent ly verif ied a nd finaliz ed the resul ts. 15 T able 4 | AASM s leep staging rules operationaliz ed for visual reasoning Stage Category Rule ID Rule Ty p e Operationalized V isual Criteria Assigned Stage W W. 1 Onset >50% of the epoch contains posterior dominant rhyt hm (alpha rhythm, 8 – 13 Hz), primarily measured in the occipital deriv ation (O2 - M1). W W. 2 Onset >50% of the epoch contains eye blinks (conjugate vertical movements at 0.5 – 2 Hz) in EOG channels. W W. 3 Onset >50% of the epoch contains irregular, conjugate rapid eye movements (REMs ) in EOG channels associated with normal or high chin muscle tone. W N1 N1.1 Onset For alpha generators: Post erior dominant rhythm is attenuated and replaced by LAMF activity f or >50% of the epoch. N1 N1.2 Onset For non - alpha generators: E arliest appearance of EEG slowing by ≥1 Hz f rom stage W (into 4 – 7 Hz range), vertex sharp waves, or slow eye movements (SEMs ). N1 N2 N2.1 Onset In the absence of cri teria for stage N3, presenc e of ≥1 K complexes unass ociated with arousals or ≥1 sleep spindles (1 1 – 16 Hz) occurring i n the first half of the epoch or the last half of the preceding epoch. N2 N2.2 Continuati on Epoch exhibits LAMF activity without K complexes or sleep spindles, but is preceded by an epoch containi ng non - arousal assoc iated K complexes or sleep spi ndles without an intervening arous al. N2 N2.3 Continuati on Epoch fol lowing an N3 s tage that no l onger meets t he criteri a for stage N3, lacks an intervening arousal, and does not meet crit eria for stage W or stage R. N2 N2.4 T ermination End N2 when transitioni ng to stage W , stage N3, or stage R; upon an arousal fol lowed by LAMF (rev ert to N1 until a K complex unass ociated with an arousal or a sl eep spindle reappears ); or upon a major body movement followed by SEMs (revert to N1). W , N1, N3, or R N3 N3.1 Onset ≥20% of the epoch cons ists of slow wave activity (0.5 – 2. 0 Hz, peak - to - peak amplitude >75 µV), predominantly in the frontal deriv ations (F4 - M1). N3 R R.1 Onset Definite R: Simultaneous presence of (a) LAMF EEG without K com plexes or sleep spindles, (b) low chi n EMG tone for the majority of the epoch and concurrent with REMs, and (c) REMs at any position within the epoch. R R.2 Continuation Epoch contiguous with a definite R epoch showing persistent LAMF EEG without K complexes or sleep spindles and low chin EMG tone, with no intervening arousals and in the absence of REMs or SEMs following an arousal. R R.3 T ermination End R when transiti oning to W or N3; chin EMG tone increases above the level of stage R for the majori ty of the epoch with N1 - like EEG; an arousal occurs followed by SEMs (score as N1 even if chin EMG remains low); or K complexes/sleep spi ndles appear in t he first half of the epoch without REMs (score as N2). W , N1, N2, or N3 Major Body Movement MB M.1 Scori ng Epoch is obscured by movement and muscle artif act for more than half an epoch. Scored as W if posterior dominant rhythm (alpha rhythm ) is present for part of the epoch, or if an epoch scoreable as stage W either precedes or follows the epoch. W MB M.2 Scori ng Epoch obscured by artifact not meeting MBM.1 crit eria. Scored as the same stage as the epoch that follows it. Same as subsequent Fifteen rules for adult sleep staging adapted from the AASM Manual for the Scoring of Sleep and Associat ed Events (V ers ion 3) [3] , covering the rules applicable to the three EEG (F4 - M1, C4 - M1, O 2 - M1), two EOG (LOC, ROC), and chi n EMG channels used in this study . For presentation, the rules shown here are compressed versions rather than the full verbatim text. AA SM American Ac ademy of Sleep Medicine, MB M major body mov ement, EEG electroencephalography , EOG e lectrooculography , EMG electromyography , LAM F low - amplitude mixed - frequency , SEMs slow eye movements, REMs rapid eye movements. 16 PSG signal processing and wavefo rm rendering W e used six PSG channels f or all model t raining and evaluation : three EEG der ivation s (F4 – M1 , C4 – M1, and O2 – M1), t wo EOG channels (LO C a nd ROC ), a nd c hin E MG, f ollowing the AA SM - recommend ed mont age for adult sleep stag ing. EEG and EOG chann els were band - pass filtered at 0.3 – 35 H z wi th a four th - order zero - phase Butte rwort h fi lte r . Chi n EMG was band - pass filtered at 10 – 100 Hz with a fourt h - order zero - phase Bu tterworth filter . A n otch fil ter (Q = 20) w as applied to all ch annels. All signals w ere then resampled to 10 0 Hz a nd s egme nted i nt o non - ove rla pping 30 - s epochs . The fil tered si gnals were ren dered as stand ardized multich annel wavefo rm images so that t he vi sion - langua ge m odel could op erate in a mode analogo us to human techn ologists visually insp ecting PS G traces. Each i mage had a resol ution of 448 × 224 pi xel s (wi dth × height) wi th a black backgroun d. The six channels wer e stacked ver tically using distinct colors an d fixed am plitude sca les: ± 50 µV for EEG a nd EOG, an d ± 40 µV for c hin EMG. T ime gr id lines wer e drawn at 1 - s and 5 - s int erva ls. Si gnal exc ursi ons be yon d t he nominal chann el boundari es were no t clipped, so th at extreme ampli tud es remained visually av ailabl e to the model. T raining frame work SleepVL M was dev elop ed in three seq uent ial ph ases: wavefo rm - percep tual p re - training (WPT), rule - grou nded supervi sed fine - tuni ng ( SFT ), a nd re jec tio n sam pli ng fine - t uning (R FT). The bac kbone mode l w as Q wen2. 5 - VL - 3B - Instruct, a 3 - billion - parameter visi on - lan guage m odel. All phases used lo w - rank ada ptati on ( LoRA) [43] for parameter - efficient fine - tu ning, with ran k  =16, scal ing f actor  =32, a nd drop out = 0.05. T raining was p erfor med on a singl e node wit h 8 N VID IA A100 G PUs (80 GB e ach) usi ng bfl oat1 6 mi xed pr ecision. Complete training, quan tization, an d infer ence hyperp arameters are su mmarized in Sup plement ary T able 7. Phase 1: W aveform - perc eptual pr e - training Phase 1 was d esigned as a per ceptual pretex t task to teach the mod el to quanti tatively inter pret PSG wavef orm image s be fore s lee p sta ging s upe rvis ion w as i ntrod uce d. The input was a si ngle 3 0 - s wavefo rm image. The model was trai ned to pre dict per - s econd sp ectral and amplitu de descri ptors rather than s leep stag es. For EEG an d EOG chann els, the sup ervision target for each 1 - s win dow c ompr ised de lta ( 0.3 – 4 Hz), thet a (4 – 8 Hz), alph a (8 – 13 Hz ), and b eta (13 – 30 H z) ba nd power in dB, t oget her w ith m ea n abs olut e v a l u e ( M AV ) i n µ V. For chin EMG , the ta rge t compri sed MA V only . Frequ ency - do main targets were deri ved from W elch power spectral est imation app lied to each 1- s win dow , us ing the f ull windo w le ngth a s npers eg a nd 5 0% o ve rlap. Band powe r wa s com put ed by i ntegra tin g the power sp ectra l dens ity w it hin e ach pr edef ined fr eque nc y ban d and t hen conve rtin g the r esu lt t o th e de cibe l sc ale . T rainin g data compri sed all epoch s from MASS - SS2 , SS4 , a nd SS 5. The syste m promp t i s pro vide d in Supple ment ary Note 1. LoRA wa s a ppli ed t o al l n n.Li ne ar layers exclu ding lm_hea d, a nd t he visio n enc oder was unfroz en, a ll owin g the model t o a dapt to t he P SG w avef orm ima ge domai n. Opt imization use d Adam W ( β₁ = 0.9, β₂ = 0.95 , ϵ =10 -8 ) wi th a learni ng rate o f 1 × 10 -4 , 3% li nea r wa rmup follow e d by li nea r dec ay , we ight deca y of 0. 1, gradie nt cli pping at a ma ximum norm of 1. 0, gradie nt a cc umula tion s teps of 1, an d a pe r - d evice batch size of 4 (effective b atch size 32 ) for 2 epoch s. Phase 2: R ule - grounded sup ervised fine - tu ning Phase 2 a dapte d the mode l to pe rform sle ep sta ging wit h st ruct ur ed, r ule - cite d rea so ning. The i nput c onsist ed of three conse cuti ve epoch im age s r epre senti ng the pre ce ding, c urre nt, and s ubse quent ep ochs. T he stagi ng dec isio n 17 was alwa ys made for t he ce ntral epoch. The syste m prom pt com prised f ive c ompone nts: role and t as k defini tion , image render ing p arameters, the co mplete AASM rul es express ed as text, t ask in struction s , an d the output f ormat. The o pera tionali zed rule set is pre sent ed in T able 4, and t he com plete prom pt is prov ide d in Supple m enta ry Note 2. T wo t ypes of su pervi si on we re m ixed d uri ng tr aini ng, s hari ng t he same syste m prom pt b ut dif fe ring i n ta sk instruc ti ons an d out put f orm at. F or the 5 fine - ann otated sub jects, the mo del was t rained to generate a sl eep stage, appli cable rule i dent ifiers, and a free - text rationale . F or the 45 c oars e - annotat ed subject s, the mod el was trained to generat e a sleep stage and applicable r ule identif iers only . At inference t ime, the model always used the fi ne - grained output f ormat and ge nerated c omplet e rationale regar dles s of how t he tra inin g sam ple w as s upervi sed . Duri ng Pha se 2, the visi on encod er was frozen to preserv e the visual r epresent ations learn ed in Phase 1, and LoR A was applied t o nn.Line ar l a yers i n t he la nguage m odel, exc ludi ng lm_he ad. Optimizatio n setting s matched Ph ase 1. T raining r an for 15 epoc hs wi th a per - dev ice batch size o f 6 without gradien t accumulati on (effective b atch size 4 8). Check points were saved every 1,000 op timizer step s, and the best checkpoint was selected on the 12 - subj ect MASS - SS3 va lidation set. Phase 3: R ejection s ampli ng fine - tu ning Phase 3 was desi gne d to e xpan d re as oning s upe rvisi on f or t he coa rse - annota ted su bject s wit hout r equi ring a dditiona l exper t an notati on. Start ing f rom the best Pha se 2 c heckp oin t, we per form ed 60 i ndepe nde nt st ocha sti c inf ere nce r uns for eve ry epoc h from the 45 coarse - annotated s ubje cts , usin g tem per ature = 1.0 a nd top - p = 1.0. A can didat e response was retain ed only if three condi tions were sati sfied: ( 1) the respon se was successfu lly p arsed into th e predicted sleep stage , expla nator y reasoni ng, an d applic abl e rule identifi ers; (2) th e pred icted sleep stage ex actly mat ched t he grou nd - truth la bel and t he predic ted rul e se t exa ctly m atc hed the groun d - trut h rule set; a nd (3) t he rat ionale was written in Englis h only . Among the ret ained co rrect candi dates fo r each ep och, th e best rationale was selected usin g a perp lexi ty - gain criterio n. Fo r a candidate reaso ning sequence  = (   , … ,   ) conditi oned on con text  , perple xity is def ine d as PPL (  |  ) = exp   1   lo g  (   |   ,  )     ( 1 ) Perpl exity - gain is t hen define d a s PPL - gain = PPL (  |  full )  PPL   text - only  ( 2 ) where  full denote s the full con diti oning c onte xt inc ludi ng the syst em pr ompt, P SG im age s, and r espo nse , a nd cte xt - only  text-only denot es the same co ntext with the PSG images r emoved . A lower PPL - gain indicates that t he rationale is mor e st rongly c ondit ione d on the vis ual in put rathe r tha n pre dict able from langua ge priors al one. For ea ch epoc h, the cor rect response wi th the minimum PPL - gain was selected as the training target. The resulting perp lexit y - gain - selected r easonin g samples wer e mixed wit h the ori ginal fine - gra ine d ann otati ons f rom the 5 f ine - annotated subject s to fo rm the Phase 3 t raining set . T rain ing u sed the same op timization settings as Phase 2 and ran for 5 epo chs. Model quanti zation T o ass ess deplo yment fea sibil ity , we applie d W4A 16 p ost - traini ng quantization (4 - bit weights, 16 - bit a ctivations) using Intel A utoRound [33] . Quantizati on was appli ed to all nn.Linear lay ers in the 36 transfo rmer blocks of the langua ge m odel, w hile the visio n e ncoder and l m_ head w ere reta ined i n floa t16 pre cisi on. C alibra t ion use d 5,0 00 18 samples d rawn fro m the train ing set with stratif ied sampling by sleep stag e. Quanti zed inferen ce was evalu ated on a single NVID IA R TX 4 090 G PU (24 G B). Additi onal qua ntiz atio n s ettings are sum mar ized i n Supple me ntary T able 7. Evaluation p rotocol For al l evaluati on runs, mod el generat ion u sed near - determini stic decod ing with t emperatu re = 10 -6 , top - p = 0.8, and a ma ximum of 102 4 ne w to kens thr ough vLLM . Full - precisio n inference u sed bflo at16, whereas q uantized infer ence used fl oat 16 . Complete inf erence settin gs are summar ized in Supplementary T able 7. SleepVL M was comp ared against 1 4 baselin e method s: 12 signal - based and 2 image - b ased. T he sign al - based method s were AttnSleep [1 0] , DeepS leepNet [8] , LPSGM [34] , ResnetMHA [35] , RobustS leepNet [3 6] , Salient SleepNet [37] , SeqSleep Net [38] , SleepDG [39] , TinySleepNet [9] , U - Sleep [ 11] , U - T ime [12] , and XSleepNet [40] . The imag e - based method s were Sl eepXV iT [2 1] and ResNet - 18 [41] , both using the same rend ered wavef orm imag es as SleepVLM . Al l baselin es were reimp lemented from th eir publi cly released co de and t raine d on the s ame data spl its a nd c hannel confi guration used in this wo rk, with cl assificati on perf ormance evalu ated under the same metri cs and bo otstrap proced ure. C lassificat ion per formance was evaluat ed usi ng accurac y , macr o - F1, and C ohe n’ s kappa . For all metrics, 95% co nfi denc e i nterva ls w ere co mpute d f rom 1,00 0 s ubje ct - level cluster boo tstrap r esamples. P er - class F1 scores are provide d in Suppl eme ntar y T abl es 4 a nd 5. Rea soning qual ity wa s eva lua ted by a tr ai ned sl eep te ch nolog ist usi ng a str uctured r ubric wit h thre e indepe ndently score d dime nsi ons on an i nteger s cal e of 0 to 5: Fac tual Ac cura cy & Perc eptual Fide lit y , Dia gnostic Evi dence Compr ehensiven ess, and Logi cal Coheren ce & Guideline Con cordance. The complete r ating scale is defi ned in Suppl ementary T able 1. For each subject , 10 epo chs were selected by stratif ied random samp lin g across sleep stages using the lar ges t - rem ainde r me tho d, yieldi ng 53 0 eva lua tio n sam ple s for M ASS - SS 1 and 1 ,000 for Z UAMHCS . The same epoch iden tifiers w ere used across the comp ared mo del config uration s. The rater review ed each samp le by exam ining t he re ndere d PSG ima ges a longsi de t he com plete mode l output (pr edic ted sl eep s tage , a pplica ble r ule ident ifiers, and ration ale) and scored each of the three dimen sions . Data A vailability The MAS S dataset i s publicly avail able at http s://bor ealisdat a.ca/dataver se/MASS . The Z UAMHCS d ataset is no t public ly ava ila ble due t o patie nt p riva cy re gulati ons but is a vail abl e from the c orre spondi ng authors upon re as onabl e reque st an d wi th inst itut iona l et hi cs a pprova l. The MAS S - EX dataset released in t his study is avail able at Zenodo ( 10.52 81/ze nod o.19 0871 97 ), G itHub ( ht tps:/ /git hub.com/D eng - G uiFeng/MASS - EX ), a nd H ugging Fac e ( https:/ /huggingf ace .co/da tas ets /Fe ng613/ MASS - EX ) under the CC BY - NC 4. 0 licen se. MASS - E X contain s annotat ions only; t he un der lying PSG sig nals must be obta ine d sepa rate ly from MA SS s ubje ct t o its data use agreement . Code A vailabil ity The so urce cod e for Sleep VLM is pu blicly availab le at https:/ /git hub.com/D eng - G uiFe ng/S lee pVLM . Pre - trained model weights (f ull - precision an d quantized ) are available on Hugging F ace ( https://huggingfa ce.co/collectio ns/Feng613/s leepvlm ). The base model Qwen2.5 - VL - 3B - Instruct is avail able at https:/ /huggingf ace .co/Q wen/ Qwe n2.5 - VL - 3B - I nstruct . 19 Referen ces 1. Chatt u, V . K. et al. The globa l pr oblem of ins uff icie nt sle ep a nd it s s erious publi c hea lth im plica ti ons. Healt hcar e 7 , 1 (20 18) . 2. Ben jafield, A. V . et al. Est imatio n of the gl obal prev alence an d burden of obst ructive s leep apnoea: a l iteratu re - based an alysis . Lancet Resp ir . Med. 7 , 687 – 698 (20 19). 3. Berry , R. B. et al. The A ASM Manu al f or the Scor ing of Sl eep and Associate d Ev ents : Rul es, T er minolo gy a nd T ec hnical Specifications. (American Academy of Sl eep M edicine, 20 23). 4. Lee, Y . J. , Lee, J. Y ., Cho, J. H. & Cho i, J. H. Inter rater reli ability of sleep stag e scori ng: a meta - anal ysis. J. C li n . Sleep Med . 18 , 193 – 202 ( 2022) . 5. Rosen berg, R. S. & V an Hout, S. The American Academy of Sleep Med icine in ter - scorer reliab ility prog ram: sleep stage sco ring. J. Clin. Sleep Med. 9 , 81 – 87 (20 13) . 6. Phan , H. et al . SleepTransformer: au tomatic sl eep stag ing with i nterpretab ility and uncertai nty qu antificati on. IEEE T rans. Biomed. Eng. 69 , 2456 – 2467 (20 22) . 7. Biene fel d, N. e t al. S olvin g th e expl ainabl e AI c onund rum by br idging c li nicia ns’ needs and de ve lopers ’ goa ls. npj Di git. Med. 6 , 94 ( 2023). 8. Suprat ak, A., Dong, H., W u , C. & Guo, Y . Deep SleepNet: a mod el for aut omatic sl eep stage scor ing based on raw single - channel E EG. IEEE T rans. Ne ural Syst. Reh abil. Eng. 25 , 1998 – 2008 (20 17). 9. Supra tak, A. & G uo, Y . T inySle epNe t: an e f fici ent dee p lea rni ng mo del for sle ep sta ge s cori ng ba sed on r aw single - channel E EG. in 2020 42nd Annual Inte rna tional Confer ence of t he IEEE En gi neer ing in M edi ci ne & B iolo gy Societ y (EMBC) 64 1 – 64 4 (2020 ). 10. Eldel e, E. et al. An attention - based deep learni ng approach fo r sleep stage cl assificati on with sing le - chann el EEG. IEEE T rans. Neur al Syst. Re habil . Eng. 29 , 809 – 818 (20 21). 1 1. Perslev , M. et al. U - Sleep: resilient high - frequency sleep staging. npj Digit. Med. 4 , 7 2 (20 21). 12. Pers lev , M., Jen sen, M. H., Darkn er , S., Jennum, P . J. & Igel, C. U - Tim e: a f ully convolutio nal n etwork for time seri es s egme ntat ion app lied t o sleep s tagin g. i n Adv anc es in N eura l Inform ation Pr ocessin g Sys tem s 32 ( 2019) . 13. Pha ng, C. R. & Hir ata , A. E xpla inable multisca le t empor al con voluti onal neur al netw ork m ode l for sle ep sta ge detecti on based on electro encephalog ram activ ities. J. Neural E ng. 22 , 026010 (20 25). 14. V an Der Don ckt, J. et al . Do not sleep on t radition al machin e learnin g: si mple and in terpretab le techn iques are compe titive to dee p lea rnin g for s leep scor ing. B iom ed. Sign al Pr oce ss. C ontr ol 81 , 1 0442 9 (20 23). 15. Ho rie, K. et al. Automated sleep stage scoring emp loying a reaso ning mechan ism and evaluati on of its explainability . Sci. Rep. 12 , 12799 (2022) . 16. Park, K., Hong, J ., Lee, W ., Shin, H. & Kim , H. DistillSlee p: real - time, on - d evice, i nterp retable sleep stag ing from single - chann el electro encephal ogram. Sleep 48 , zs ae 234 (20 25). 17. M uto, V . & B erth omi er , C. Lo oking for a ba la nce be twe en visua l and aut oma tic s lee p scor ing . npj D igit. Med. 6 , 165 (2 023). 20 18. Pei, Y ., Xu, J., Y u, F ., Zhan g, L . & Luo, W . W aveSleepN et: an i nterpr eta ble net wor k for e xper t - li ke sl eep s taging. IEEE J. Biom ed. Health Inform . 29 , 1371 – 1382 (20 25). 19. Dut t, M. , Redh u, S. , Goo dwin , M. & O mlin, C. W . SleepXA I: a n expla ina ble de ep lea rni ng ap proa ch for multi - class sleep s tage id entifi cation. Appl. Intell. 53 , 168 30 – 16843 ( 202 3). 20. V aqueri zo - V illar, F . et al. An explainab le deep - lear ning m ode l to sta ge s leep s tat es in c hildre n an d pro pose novel EEG - related p atterns in sleep apnea. Comput. Bi ol. M ed. 165 , 1074 19 (20 23). 21. Le e, H . et al. Expl ai nable vis ion tra nsform er f or automa tic visu al s lee p sta ging on m ultimodal PSG sig nals. npj Digit. Med. 8 , 55 (2025 ). 22. Ni knaz ar , H. & Me dnic k, S . C. A multi - le vel i nter preta ble sl ee p stage sc oring sys tem by inf using ex pert s’ knowle dge i nto a dee p ne twor k ar chit ect ure. I EEE T ra ns. P atte rn A nal. M ach. I nte ll. 46 , 5005 – 502 0 (20 24). 23. Al - H ussain i, I. & Mitchell, C. S. SE RF: interpretab le sleep stag ing using embedding s, rules, and feat ures. in Pr oc. 31st ACM Inte rnat ion al C onfe r ence on Infor mati on a nd K no wledge Ma nage ment ( CIKM ) 3 798 – 38 02 (20 22). 24. Ho lland, R. et al. Special ized curricul a for training v ision language mod els in retinal i mage analysis. n pj Di gi t. Med. 8 , 532 (202 5). 25. Chen, X. et a l. FF A - GPT : a n autom ate d pipe line f or fun dus f luo res cei n angiogra phy i nter preta ti on a nd ques tio n - answer . npj Di git. M ed. 7 , 234 (2024 ). 26. Shao , A. et al. Generative ar tificial int elligen ce for fundus fluor escein angi ography interpretati on and human expert evaluat ion. npj Digit. Me d. 8 , 12 (2025 ). 27. Lu, M. Y . et a l. A mul timoda l gene rati ve AI co pil ot f or hum an p athol ogy . Na t u re 63 4 , 46 6 – 473 (2024 ). 28. Lu, M. Y . et al. A visua l - lang uage founda tio n mode l for compu t ationa l pat hol ogy . Nat . M ed. 30 , 863 – 874 (2 024). 29. Chr iste nsen, M., V ukadi novi c, M ., Y uan, N. & O uyan g, D. V isio n – la ngua ge f ounda tio n mo del f or echocard iogram i nterp retation . N at. Me d. 30 , 1481 – 1488 ( 2024) . 30. Ji n, Q. e t al. H idde n f laws beh ind ex pert - level accuracy o f multimodal GPT - 4 vis ion i n me dicine . npj Di git. M ed. 7 , 19 0 (2024 ). 31. Kim, H. et al. S mall lang uage model s learn en hanced reaso ning sk ills from med ical tex tbooks. npj D igit . Med. 8 , 35 ( 2025) . 32. Bai, S. et al. Qwen 2.5 - VL tec hnical rep ort. Pre print at https: //a rxiv . or g/abs /250 2.13 923 ( 202 5). 33. Che ng, W . et a l. Opt imiz e w eight r oundi ng via sig ned gra die nt de sce nt for t he qua ntiz atio n of L LMs . in Fi ndi ngs of the Assoc iatio n f or Com put atio nal L ingu isti cs: EMNL P 2024 1 1332 – 1 1 350 (20 24) . 34. Den g, G . et a l. A un ifie d fle xible lar ge PSG m odel f or s lee p sta gin g and br ai n dis order dia gnosi s. medRxiv https:/ /doi.or g/1 0.1 101/2 024.1 2.1 1.24318 815 (2 024) . 35. Qu , W . et al. A residual based attention model for EEG based sleep stagi ng. IEEE J. Biomed . Health I nform. 24 , 2833 – 2843 (2020 ). 36. Gu illot, A. & Thor ey , V . Robust SleepNet : transfer l earning fo r automated sleep staging at scale. IEEE T rans. 21 Neural Syst . Re habil . Eng. 29 , 1441 – 145 1 (2021 ). 37. Jia, Z. et al. SalientSlee pNet: m ultimodal salie n t wave dete ction ne twork for sleep sta ging. in Pr oc. Thirtieth International J oint Confer ence on Artificial Inte lligence (IJ CAI - 21) 26 14 – 2620 (20 21). 38. P han, H., Andr eotti , F ., C oora y , N., Ché n, O . Y . & De V os, M. Se qSlee pNet: e nd - to - end h ierarchical recurren t neura l netw ork for seq uenc e - to - sequence automati c sleep stag ing. IEEE T r ans. Ne ural Syst. Re habil. Eng. 27 , 4 00 – 410 (2 019). 39. W ang , J. et al. Generalizab le sleep staging via multi - lev el domain alignmen t. in Proc. AAAI Conf er ence on Artificial Intelli gence 38 , 265 – 273 (2 024). 40. Phan, H. et al. XSle epNet: m ulti - view sequen tial mod el for automat ic sleep st agin g. IEEE T rans . Pa ttern Anal. Mach. Intell . 44 , 5903 – 5915 (2022 ). 41. He , K., Zha ng, X ., Re n, S . & Sun , J. D eep r esi dual le arnin g for im age rec ogniti on. i n Pr oc. IEEE Conference on Compute r V ision an d Patt ern Rec ognit ion (C VP R) 770 – 783 (2 016) . 42. O’Reilly , C., Gossel in, N., C arrier , J. & Nielsen , T . Montreal Archiv e of Sleep S tudies: an op en - access reso urce for inst rum ent benc hmar king a nd explorat ory res ear ch. J. Sleep Res. 23 , 62 8 – 635 (20 14) . 43 . Hu, E. J. e t al. L oR A: low - rank ada ptati on of lar ge la nguage models. in Inter nati onal Confer ence o n Le arni ng Repr esent ations (2022 ). Acknowledge ments This w ork was sup porte d by B rain S cienc e a nd Brai n - like Intellige nce T echnol ogy — National Science and T echnol ogy M aj or Pro jec t (2022 ZD02 12400 , 202 1ZD0 2004 04), Nat ional N atura l Sc ience Foun da tion of Chi na (8237 1453), Key R &D Pr ogra m o f Zhej iang ( 2024C 03006, 202 4C0402 4), Z heji ang K ey La borat ory of Cli nical an d Basi c Re sea rch for Psyc hiatr ic Dis eas es ( 2024ZY 0101 0, 20 24E101 07), “Pi oneer ” an d “L eadi ng G oose” R&D Program of Z heji ang ( 202 6C01 013) , Fun dame ntal and I nter disc iplina ry D isc ipli nes B rea kthro ugh Pla n of t h e Ministr y of Ed ucati on of C hina (J YB202 5XDX M605), and the C onstruct ion Fund of K ey Medic al Disc ipli nes of Hangz hou ( 2025H ZGF 10). Author Contributions G.D. c once ive d the stu dy , de signe d t he me thodo lo gy , deve lope d the s oftwa re, conduc ted all mod el t raini ng a nd exper ime nts, a nd wr ote t he ori ginal ma nuscr ipt. P .W . perf orme d c linic al da ta a nnota tion, cont ribut ed t o da ta interp retati on, and con ducted ex pert e valuati on of re as oning qual it y . J.W ., S.R a nd J .X cont rib uted t o methodol ogy design a nd im plem enta tio n. W .G and T .L. supe rvise d cli nic al a nnotat ion, co ntri buted to data inter pretation a nd expe r t evalua tio n, and re view ed an d edi te d the m anusc ript. H .J. s upe rvise d the s tudy , s ec ured fun di ng, pro vided gui dance on stud y des ign, and revi ewe d an d edite d the ma nusc ript. All aut hor s re ad and a ppr oved the f ina l ma nuscr ipt. Competing Interes ts The auth ors declare n o comp eting int erests. 22 Additional informa tion Supple ment ary i nform ati on 1 Supplem enta ry Inform ation SleepV LM: E xplai nabl e and Rule - Grounde d Sleep S tagin g via a Vision - Language Mode l Supplementary Figur e 1 | Add itional qualitative examples of model reasoning (set 1) ················ ················· ·· 2 Supplementary Figure 2 | Additional qualitative examples of model reasoning (set 2) ················ ············ ········ 4 Supplementary T able 1 | Rating scale for expert evaluation of model - generated sleep staging reasoning ········ 6 Supplementary T able 2 | Characteri stics of dataset s used in this study ·················· ·········· ········· ················· · 7 Supplementary Table 3 | Sleep stage distribution across datasets used in this study ··············· ········· ············ 8 Supplementary T able 4 | Per - class F1 scores for SleepVLM and baseline methods on MASS - SS1 a nd ZUAMHCS · ····················· ········ ················· ·········· ·········· ········· ················· ·········· ·········· ················ ········ · 9 Supplementary T able 5 | Per - cl ass F1 scores for ab lation configuration s on MASS - SS1 a nd ZUA MHCS ········· 11 Supplement ary T able 6 | Expe rt r ating s core dist ribut ions acros s model c onfigurat ions ······················ ········ 12 Supplementary T ab le 7 | T rai ning, quantization, and inference h yperparameters ·········· ·········· ······ ··············· 13 Supplementary Note 1 | System prompt for Phase 1 waveform - perc eptual pre - traini ng ···························· ···· 14 Supplementary Note 2 | System prompt for Phase 2 supervi sed fine - tuning ···················· ·········· ············· ···· 15 Supplementary References ······················· ·········· ·········· ················ ·········· ·········· ·········· ················ ········ 1 8 2 Supple ment ary Fi gur e 1 | Addit ional q uali tati ve exam ple s of m odel r easoning (s et 1) . Four ad diti onal e xample s of Slee pVLM o ut puts, c ompl eme ntin g Fig. 3 i n the main te xt. Each pa nel s how s thre e cons ec utive 30 - second PS G epoch im age s ( prec edin g, cur re nt, s ubseq uent) with six c hannels ( F4 - M1, C4 - M1, O2 - M1, LOC, ROC, Chin EMG), Sta ge : W, Rul es : [ W.1, W.2] Reas oning : EEG cha nn els (F4-M1, C4-M1, O2-M1) displ ay p romi ne nt alp ha rh ythm (8- 13 Hz ) fo r ~ 90 % of th e ep och . A mpl i tude s meas ure 40 - 50 µV peak - to - peak , exhib i ting sinus oida l posteri or domina nt rhyt hm . E OG chann el s ( L OC , ROC ) exhi bit inter mit tent conj ugat e v ert ica l blinks (0.5-2 Hz ) at 14 - 20 s, occupyi ng ~ 20 % of the ep och . C hin EMG s ho ws mode rat e bas eli ne to ne with occa sio nal bur sts at ~ 30 µV peak - to - peak ampli tude . Slow wav e act ivi ty (> 75 µV ) , K- compl ex es , slee p s pin dles , and r api d eye mov eme nt s are abs en t . Alph a domina nce and a bsen t s lee p fe at ures c onf irm Wak eful nes s . F4-M1 C4-M1 O2-M1 LOC ROC Chin Current : W Pr ec edi ng : W Subsequ ent : W ✓ Sta ge : N3, Rul es : [N3.1] Reas oning : EEG c hann el s displ ay promi ne nt sl ow wav e a ctivi ty meet ing crite ria for s tag e N3. F4-M1 exhi bit s multiple high - ampl itude sl ow w av es (> 75 µV peak - to - pe ak ) extend ing b eyon d bounda r ies and over lap ping in to adjacent C4-M1 at m ultiple inter val s (e.g., 0- 5s a nd 20 - 25 s) . C4-M1 sim ilarly s ho ws slow waves over lap ping in to O2-M1, c on firm i ng > 75 µV ampl itu des . O2-M1 displ ays lo w er - ampl itu de sl ow w aves con sis te nt with o cci pi tal d istrib ution . E OG cha nnel s ( LOC , ROC ) s ho w min imal acti vit y witho ut r api d or sl ow eye m ovem e nts . Chi n EMG demonst rat es v ery low bas eli ne m uscl e to ne . S low w ave a ctivi ty occu pies ~ 35 % of th e epoc h in fr ontal EEG cha nn els (F4-M1, C4- M1), satisfy ing the ≥ 20 % thres hold . Abs en ce of alp ha rhyt hm , K- compl ex es , slee p s pindl es , and ra pi d eye mov eme nt s rul es o ut sta ges W, N1, N2, a nd R. Sust ai ned high - am plitu de s low waves me eti ng am plitu de a nd dura ti on c riteria cla ssif y the epoch as N3. F4-M1 C4-M1 O2-M1 LOC ROC Chin Current : N3 Pr ec edi ng : N3 Subsequ ent : N3 ✓ Sta ge : N1, Rul es : [N1.1] Reas oning : EEG cha nn els (F4- M1, C4-M1, O2-M1) d is play low - ampli tude m ixed - fr eq ue ncy ( LAMF ) acti vit y dom inat ed by 4-7 Hz thet a w av es . O2-M1 s ho ws com pl et e absen ce of alp ha r hyt hm (8- 13 Hz ) obser ve d in the preced ing epoc h , w ith back grou nd fr eque nc ies s low e d by ≥ 1 Hz c ompa re d to Wakef uln es s . E OG c hann el s ( L OC , ROC ) exhi bit conj u gate sl ow e ye m ov eme nt s ( SEM s ) with s mo ot h , sinus oida l defl ecti ons ( in itial dur ations > 500 ms ); specif ically , a prom i nent S EM s pans 1 00 - 1 50 ms at 10 - 15 s . Chin EMG mai nta in s m ode rat e bas eli ne tone , ind icat ing redu ced b ut not min imal mus cl e a cti vit y . K- c ompl ex es , s lee p spin dl es , and h igh - ampl itu de slow w av es (> 75 µV ) a re a bsen t in a ll EEG cha nn els . T he precedi ng ep och sh owed alpha rhyt hm in O2-M1 a nd e ye b links ( con fir mi ng Wakef uln es s ), whi le the curr en t epoch demonst rat es d efi nit ive transit ion to N1 through c ompl et e alpha attenua t ion a nd SEM s . Atten uati on of alpha rhyt hm with re pl acem e nt by LAMF a cti vit y for the e nti re epoch meets c riteria fo r stag e N1 ( ru le N1.1). F4-M1 C4-M1 O2-M1 LOC ROC Chin Current : N1 Pr ec edi ng : W Subsequ ent : N2 ✓ F4-M1 C4-M1 O2-M1 LOC ROC Chin Current : N2 Prec e di ng : N2 Subsequent : N2 Sta ge : R , Rul es : [R.2] Reas oning : E EG c ha nnel s (F4-M1, C4-M1, O2-M1) display low - am plitude mix ed - freq ue ncy ( LAM F ) a cti vit y with predom inan t 4-7 Hz ( thet a ) fr eque nc ies . Alp ha rh yt hm (8- 13 Hz ) is a bs ent in O2-M1, witho ut K- com pl exes or s lee p s pin dl es . E OG cha nn el s ( L OC , ROC ) sho w minimal acti vit y wi tho ut ra pi d eye m ovem ent s ( REM s ) or slo w eye m ovem ent s ( SEM s ). Chi n EMG exhib i ts very lo w mus cle tone witho ut signi fi cant fl uc tua ti ons . These feat ur es demon str ate L AMF E EG a cti vit y , low chi n EMG tone , a nd a bsen ce of REM s . T he pre cedi ng epoch s ho ws s imilar LAMF p att er ns with a brief EMG spi ke at 15 - 20 s, whil e the subseq uent epoch mai nta in s ide nti cal LAM F char act eri stics . Thi s con ti nui ty c onf irms a bsen ce of ar ous al or tr ansitio n events . L AM F E EG a cti vit y , low chi n EMG t one , a nd no REMs class ify the epoch as s tag e R ( Rul e R.2). Exp er t n ot e : The pre cedi ng epoc h con ta ins a n on - arou s al - ass oc iat ed K- com plex , a nd the c urr en t epoch co ntinu es LAM F without an inter veni n g arou s al , s atisfyin g the N2 continu ation crite rion (N2.2). Chi n EMG is at the normal N2 level , not a t the R EM - level low t one requ ir ed for s tag e R— AASM no tes that N2 EMG may appr oach stag e R le vels . No con ti guo us d efi nit e s tag e R epoch is p rese nt , and the preced ing - ep och act ivit y des cri bed as an " EMG spike " is a conju gate eye mov eme nt in EOG chann el s ( per cept ual cha nn el conf usi on ). The applic able rule is N2.2, n ot R.2, y iel din g a cor rec t stag e of N2. ✗ 3 the mod el's predi cted sleep stage, cited AASM rule id entifi ers, and the comp lete ration ale. Check marks ( ✓ ) denote correct cl assificati ons an d cro sses ( ✗ ) denote mis cla ssif ica tions a gains t the gr ound - tr uth label. Three panels illustrate corre ct s taging: W akefulne ss ide ntified vi a pr omine nt a lpha r hyt hm and e ye blin ks (Rul es W .1, W .2) , sta ge N 1 identif ie d via al pha rhyt hm a tte nuat ion a nd slow eye m oveme nts (Rul e N1 .1), an d st age N 3 identi fie d via hig h - ampli tude slow wave activ ity exce eding the ≥20% / > 75 µV thre shold i n fr onta l cha nnel s ( Rule N 3.1). The t hird panel s how s a m isc las sific ati on i n whi ch the model pre dicted sta ge R for a gr ound - trut h N 2 epoc h. T he mode l c ite d Rule R .2 base d on perc eive d LA MF ac tivi ty a nd l ow c hin E MG t one; howe ver , ex pert r evie w d eter mined th at chin EMG is at the no rmal N2 level — not at the REM - l eve l l ow t one r equ ired f or s tage R — noting that chi n EMG i n sta ge N2 may be as low as in stage R (AASM Scoring Stage N2, Note 6). The activ ity describ ed as an EMG chang e in the precedi ng epoch is in fact co njugate ey e movement in EOG chan nels (percep tual chann el confusio n), and no conti guous def init e st age R epoch is pres ent, pre clud ing a pplic a tion of R ule R.2. A n on - arousal - associated K - comple x in t he pre cedi ng e poch s atis fie s the N 2 conti nua tion c rite rion (R ule N2.2) . PSG channe ls a re color - cod ed as de scr ibed in Me tho ds. The com plete defi niti ons of all c ite d AASM rul es a re pr ovide d in T able 4. Rat ionale s have been c onde nsed f or brevit y without a lte ring t he substa ntive m ea nin g. 4 Supple ment ary Fi gur e 2 | Addit ional q uali tati ve exam ple s of m odel r easoning (s et 2) . Four ad diti onal e xample s of Slee pVL M out puts . La yout a nd notati on ar e as in S upple ment ary Figur e 1. Three pa nels ill ustra te c orrec t st agin g: W ake fulne ss s core d via t he m ajor bod y mo vem ent rul e, wi th al pha rhythm i dentified in the subse que nt epoc h and Sta ge : W, Rule s : [ MBM .1, W.1] Reas oning : E EG cha nnel s (F4-M1, C4-M1, O2-M1) sho w alpha rh ythm in O2-M1. Th ese artifac ts appear at 10 - 20 s. C hin EMG mai nta in s h igh ampli tude , indicati ng s ustai ne d high mus cl e to ne . E OG cha nn els ( L OC , ROC ) sho w i rr egu lar h igh - ampl itu de defl ec tio ns at 10 - 20 s, attri bute d to mov eme nt a rtif ac t rather th an phys iolog i cal e ye mov eme nt s due to non - conju gate na tur e and corr el ati on w ith E EG spik es . The precedi ng and subs equent epoc hs ex hib it Wakefu ln ess fe atur es : the subs eq uent ep och sho ws cl ear 10 Hz alpha rh yt hm in O2-M1 occu pyi ng 60 % of th e epoch (~ 45 µV ) meeti ng R ul e W.1, whi le b oth adj acent epochs mai nta in elev at ed c hin EMG to ne . P er Rul e MB M .1, w hen maj or body mov eme nt obscur es EEG , the epoch is sco red as Wak eful nes s if bot h adj acent epochs a re Wakefu ln ess . Arti fact p att er n and h igh EMG tone are incon sisten t with s lee p stages , conf irm ing Wak eful nes s . F4-M1 C4-M1 O2-M1 LOC ROC Chin Current : W Pr ec edi ng : W Subsequ ent : W ✓ F4-M1 C4-M1 O2-M1 LOC ROC Chin Current : N2 Pr ec edi ng : N2 Subsequ ent : N2 Sta ge : N2, Rul es : [ MBM .2] Reas oning : Maj or body mov eme nt is evidence d by extr eme amplitu de a rtif ac t in c hin EMG at 5- 15 s (> 80 µV peak - to - peak ) ext endi ng into EOG c hann el s . Alph a rhyt hm is abs en t ; F4-M 1 a nd C4-M1 sho w no K- c ompl ex es or s lee p spin dl es . The pre cedi ng epoc h s ho ws low - ampl itude mix ed - freq ue ncy E EG witho ut alpha rhyt hm , w hi le th e subs eque nt epoch c ont ains a c lea r K- com pl ex in F4-M1 at 0-1s ( n egat ive s har p wav e followed by positiv e wave , ~0.5s dura ti on ) w ithin low - am pli tude m ix ed - fr eque ncy a cti vit y . P er Rul e MBM .2, when al pha rh yt hm is a bsen t a nd ad jac ent epochs a re n ot Wak e , a m aj or body m ovem e nt epoc h is sco red as th e subs eque n t ep och 's stage . The subseq ue nt epoc h meets s tag e N2 crite ria ( Rul e N2.1), clas si fyi ng the epoch as N2. ✓ F4-M1 C4-M1 O2-M1 LOC ROC Chin Current : N1 Pre ce di ng : R Subsequ ent : R Sta ge : N1, Rul es : [N1.2, R.3] Reas oning : EEG cha nn els (F4- M1, C4-M1, O2-M 1) displ ay lo w - am pli tude m ixed - f req ue ncy ( LAMF ) acti vit y dom inat ed by 4-7 Hz thet a wav es , wit hout alp ha rh yt hm or high - ampli tude s low wav es . O2-M1 s ho ws con tin uou s thet a acti vit y witho ut the 8- 13 Hz sin usoi dal patt er n of alp ha rhyt hm . EOG cha nn el s ( L OC , ROC ) exhi bit c onj ugat e sl ow e ye mov eme nts ( SEMs ): specif ically , a prom i nent sinusoi dal defl ec tio n at 13 - 15 s la sting ~ 1000 ms , me eti ng th e > 500 ms initi al defl ect ion c riterio n . Chi n EMG exhib its mod er ate amplitu de activi ty (~ 32 µV peak - to - peak ), significant ly hig her th an th e near - zero t one char act eri stic of REM s leep . Abs en ce of ra pid e ye m ov eme nts , K- com pl exes , s leep spi ndl es , a nd s low w ave acti vit y (> 75 µV ) rules out stages R, N2, a nd N3. Presen ce of thet a - freq ue ncy EEG a nd S EMs sati sf ies tran sition criteria for stag e N1 ( R ul e N1.2) for individuals lackin g alpha rhyt hm . The pre cedi ng and subs eq uent epochs s ho w REM s lee p feat ur es ( R EM s and low chi n EMG ), c on firm ing a trans ient ar ous al with SEM s that ter min at es REM ( Ru le R.3), nec es sit at ing scoring as stag e N1. ✓ F4-M1 C4-M1 O2-M1 LOC ROC Chin Current : W Pr ec edi ng : W Subsequ ent : W Sta ge : N1 , Rul es : [N1.2] Reas oning : E EG c ha nnel s (F4-M1, C4-M1, O2-M1) dis play low - ampli tude m ixed - fr eq uenc y ( LAMF ) acti vit y dom inat ed by 4-7 Hz thet a w aves a cro ss fronta l , central , a nd o cc ipi tal regi on s . Alph a rh yt hm (8- 13 Hz ) is a bs ent in O2-M1, wit hout K- com pl exes or s lee p spi ndl es . E OG c ha nnel s ( L OC , ROC ) exhi bit conju gate s low e ye m ovem ent s ( SEM s ) with s mo ot h , sinus oida l d efl ec tio ns ( in itial dur atio ns > 5 00 ms ) ; spe cifically , a prom ine nt SEM spans 10 - 15 s (~ 250 µV peak - to - pe ak ). C hi n EMG mai nta in s m ode rat e b as eli ne tone , ind icat ing normal m usc le acti vity . A bs ence of h igh - amplitu de slow waves (> 75 µV ) rules o ut N3, whi l e la ck of ra pi d e ye mov eme nt s a nd n on - low c hin EMG to ne exclu des R. Presen c e of S EMs a nd th eta - domi nant LAM F a ctivi ty s ati sfi es crite ria fo r s tag e N1 ( Rul e N1.2), repr esen ti ng th e e arl iest tr ansit ion from Wak efu ln ess without alpha rhythm . Exp er t not e : Posteri or do mina nt alp ha rh ythm (8– 13 Hz ) per sist s in O2-M1 fo r m ore th an 50 % of the epoc h ; th e model misident ified this as L AMF . The EOG d efl ect ions at approxi matel y 9– 11 s, 20 – 21 s, and 24 – 25 s a re r ecur rent e ye blinks (0.5–2 Hz ) d uri ng wak ef uln ess , n ot s low eye m ovem ent s . B ecau se this s ubjec t g en er ates a posteri or domi na nt rh yt hm , Rul e N1.2— w hic h applies only to non - alp ha gen erat ors — is ina pplic able . The a pplicab le rules are W.1 a nd W.2, y iel din g a cor rec t st age of W. ✗ 5 both a djac ent epoc hs c lass ifie d as W (R ules MB M.1, W .1); st age N 2 ass igned to a ma jor b ody move ment epoc h by refe renc ing t he s ubse quent epoc h's stage (R ule MBM. 2); a nd stage N1 c orrec tl y ide ntif ied dur ing a REM - to - NR E M trans itio n, w here the prec edin g a nd s ubseq uent e pochs show RE M sle ep fe atur es w hile the cur ren t epoc h ex hibi ts increased chin EMG ton e, LAMF activ ity , and slow ey e movemen ts, con firming a t ransien t arou sal that ter minates stage R (Rul es N1.2, R.3). T he second panel shows a misclassifi catio n in which the mod el predicted stag e N1 for a grou nd - truth W epoc h. The m ode l cit ed Rule N1.2 ba sed o n perc eive d LAMF a cti vity a nd s low ey e m ovem ents; howev er , expert revi ew determined that O2 - M1 retai ns poster ior domi nant rhythm (alpha rhythm) for >50% of th e epoch — the model incor rectly classi fied th is activ ity as L AMF . The conjug ate verti cal EOG deflect ions at approx imately 9 – 11 s , 2 0 – 21 s, and 24 – 2 5 s are eye blinks (0.5 – 2 Hz) characteri stic of w akeful ness, n ot slow eye movemen ts. Since t he indi vidual gener ates alph a rhyth m, Rule N1.2 — which a pplies onl y t o indiv idu al s who d o not gener ate pos ter ior dom ina nt rh ythm — cannot be i nvoked . The appl ica ble rule s a re W .1 and W .2. PS G cha nnels are color - coded as de scr ibed i n Met hods . The c omple te de fini tio ns of all c ite d AASM rul es a re pr ovide d in T ab le 4. Rati onales have bee n co ndense d f or bre vit y wi thout alt erin g the sub sta ntive m ea ning. 6 Supplementary T able 1 | Rating scale for expert evaluation of model - generated sleep stagin g reasoning Dimension 1: Factual Accuracy & Perceptual Fidelity Whether the model’s descri ptions of waveform existence, frequency, amplitude, and temporal proportion faithfully reflect the rendered PSG image. 5 (Excellent ) All waveform descriptions m atch the image with zero factual errors in existence, frequency, amplitude, or epoc h proportion. 4 (Good) Core stage - defini ng features accurately described; only minor deviations i n non - crit ica l background s ignals. 3 (Acceptabl e) Waveform existenc e correctly identified; localized att ribute or temporal estimation errors that do not alter the overall staging det ermination. 2 (Marginal) Signific ant attribute errors, or k ey waveform amplitude/frequency estimat es severely v iolate the rendering scal es. 1 (Poor) Fabricat ed waveforms described (absent in image), or the majority of feature descriptions contradict image evidence. 0 (Fail) Descriptions ent irely unrelated to the PSG image. Dimension 2: Diagnostic Evidence Comp rehensiveness Whether t he model system atically presents multi - channel evidence (EEG, EOG, EMG), stage - defi ning features, exclusi onary features, and adjacent - epoch context required by AASM rules. 5 (Excellent ) All three s ignal syst ems covered; st age - defining features exhaustive; exclusionary evidence depth proportionat e to staging ambiguity; adjacent - epoc h context utilized when applicable. 4 (Good) Core channels and key stage - defining f eatures compl ete; minor brevi ty in adjacent - epoch utilizat ion or secondary exclusionary features, not compromising the evidenc e chain. 3 (Acceptabl e) Core stage - defini ng features provided but with one of: a non - key channel omitted, insufficient exclusi onary evidence for an ambiguous epoch, or adjacent - epoch context underutilized. 2 (Marginal) Key channel descripti ons omitted per AASM requirements, or necessary exclusionary evidence absent in a highly confusable epoch. 1 (Poor) Only si ngle - channel description or concl usory judgment; no substantive multi - channel f eature analysis. 0 (Fail) No channel - feat ure evidence usable for clinical interpret ation. Dimension 3: Logical Cohere nce & Gu ideline C oncorda nce Whether t he reasoning chai n from observ ed features t o the staging conc lusion is rigorous, self - consistent, and adherent to the AASM scoring manual. 5 (Excellent ) Rigorous causal logic f rom evidence to conclusion; cited AASM rules precisely applic able; reasoning forms a complet e logical closed - loop at expert level. 4 (Good) Core reasoning correct and evi dence sufficiently supports the conclusion; AASM rules appl ied accurat ely; minor ineffic iencies in s econdary steps only. 3 (Acceptabl e) Correct concl usion with accurate core reasoning; slight logical leaps or loose integration between cited rules and described evidence. 2 (Marginal) Internal contradic tions present; described evidence mismatches t he cited scoring rules or reasoning directi on, failing to support the conclusion. 1 (Poor) Fundamental opposition bet ween the factual evidence presented in the reasoning and the final staging c onclusion. 0 (Fail) No coherent reasoning pres ent; no valid logical chain formed. Each dimensi on is scored independently on a 0 – 5 integer scale (maximum total: 15). AASM, American Academy of Sleep Medicine. 7 Supplementary T able 2 | Characteristics o f dataset s used in this study Dataset n Age, y (mean ± SD) Age range, y Sex (M / F) EEG electrodes Fs (Hz) Scoring standard Epoch length (s) Use in this study MASS - SS1 53 63. 0 ± 5.3 55 – 76 34 / 19 17 – 19 256 AASM 30 Held - out test se t MASS - SS2 19 23. 6 ± 3.7 18 – 33 8 / 11 19 256 R&K 20 Pre - training MASS - SS3 62 42.5 ± 18.9 20 – 69 29 / 33 20 256 AASM 30 Traini ng and vali dation MASS - SS4 40 25. 3 ± 4.3 18 – 35 14 / 26 4 256 R&K 20 Pre - training MASS - SS5 26 25. 0 ± 7.4 20 – 59 13 / 13 20 256 R&K 20 Pre - training ZUAMHCS 100 28.4 ± 10.5 18 – 70 42 / 58 6 512 AASM 30 External test set MASS, Montreal Archive of Sl eep Studies [1] ; ZUAMHCS , Zhejiang Univers ity Affiliated Mental Health Center Sl eep dataset . Fs, sampling frequency of EEG, EOG, and EMG channels. R&K, Rechtschaffen and Kales; AASM, American Academy of Sleep Medicine. 8 Supplementary Table 3 | Sleep stage distributi on across datasets used in this study Dataset Use in this study To t a l epochs W N1 N2 N3 R MASS - SS1 Held - out test 51,165 12,152 (23.8 %) 7,098 (13.9 %) 22,152 (43.3 %) 3,402 (6.6 %) 6,361 (12.4 %) MASS - SS3 T raining and validat ion 59,317 6,442 (10.9 %) 4,839 (8.2 %) 29,802 (50.2 %) 7,653 (12.9 %) 10,581 (17.8 %) ZUAMHCS External test 96,317 13,765 (14.3 %) 7,059 (7.3 %) 39,809 (41.3 %) 19,549 (20.3 %) 16,135 (16.8 %) 9 Supplementary T able 4 | Per - class F1 scores for SleepVLM and baseline methods on MASS - SS1 and ZUAMHCS Method Input W N1 N2 N3 R MASS - SS1 (n=53) AttnSleep Signal 0.912 [0.890, 0.929] 0.514 [0.483, 0.545] 0.841 [0.824, 0.857] 0.712 [0.657, 0.757] 0.801 [0.749, 0.840] DeepSleepNet Signal 0.907 [0.883, 0.926] 0.558 [0.528, 0.586] 0.840 [0.821, 0.857] 0.606 [0.522, 0.663] 0.844 [0.796, 0.883] LPSGM Signal 0.913 [0.896, 0.927] 0.600 [0.576, 0.625] 0.867 [0.854, 0.880] 0.715 [0.663, 0.756] 0.876 [0.846, 0.901] ResnetMHA Signal 0.907 [0.887, 0.923] 0.561 [0.531, 0.589] 0.807 [0.782, 0.828] 0.679 [0.595, 0.744] 0.848 [0.799, 0.885] RobustSl eepNet Signal 0.903 [ 0.886, 0.917] 0.586 [0.557, 0.615] 0.855 [0.837, 0.871] 0.589 [0.499, 0.657] 0.844 [0.815, 0.870] SalientSleepNet Signal 0.876 [0.853, 0.897] 0.477 [0.453, 0.499] 0.849 [0.834, 0.862] 0.612 [0.544, 0.666] 0.797 [0.754, 0.834] SeqSleepNet Signal 0.886 [0.864, 0.905] 0.465 [0.438, 0.494] 0.851 [0.837, 0.865] 0.673 [0.609, 0.724] 0.835 [0.792, 0.871] SleepDG Signal 0.888 [0.862, 0.910] 0.514 [0.482, 0.543] 0.860 [0.845, 0.874] 0.701 [0.642, 0.752] 0.849 [0.818, 0.873] TinySleepNet Signal 0.909 [0.891, 0.925] 0.568 [0.539, 0.597] 0.873 [0.859, 0.886] 0.692 [0.632, 0.743] 0.855 [0.822, 0.880] U- Sleep Signal 0.885 [0.862, 0.905] 0.505 [0.479, 0.531] 0.862 [0.848, 0.875] 0.632 [0.570, 0.682] 0.858 [0.827, 0.882] U- Time Signal 0.859 [0.824, 0.887] 0.440 [0.413, 0.467] 0.828 [0.812, 0.842] 0.599 [0.523, 0.661] 0.787 [0.732, 0.830] XSleepNet Signal 0.853 [0.824, 0.879] 0.430 [0.397, 0.467] 0.841 [0.820, 0.860] 0.607 [0.524, 0.677] 0.825 [0.789, 0.859] SleepXViT Image 0.917 [0.904, 0.929] 0.543 [0.511, 0.572] 0.877 [0.867, 0.888] 0.752 [0.696, 0.800] 0.878 [0.847, 0.900] ResNet - 18 Image 0.918 [0.901, 0.932] 0.523 [0.486, 0.559] 0.873 [0.861, 0.884] 0.737 [0.686, 0.778] 0.839 [0.811, 0.863] SleepVLM (ours) Image 0.917 [ 0.904, 0.929] 0.527 [0.498, 0.555] 0.872 [0.861, 0.884] 0.784 [0.722, 0.827] 0.863 [0.842, 0.880] ZUAMHCS (n=100) AttnSleep Signal 0.754 [0.701, 0.799] 0.401 [0.373, 0.429] 0.817 [0.798, 0.834] 0.809 [0.785, 0.830] 0.724 [0.694, 0.752] DeepSleepNet Signal 0.769 [0.708, 0.818] 0.443 [0.415, 0.470] 0.816 [0.797, 0.833] 0.758 [0.727, 0.783] 0.720 [0.684, 0.752] LPSGM Signal 0.819 [0.761, 0.863] 0.533 [0.506, 0.560] 0.859 [ 0.842, 0.874] 0.811 [0.785, 0.832] 0.853 [0.832, 0.871] ResnetMHA Signal 0.827 [0.763, 0.869] 0.451 [0.423, 0.477] 0.790 [0.769, 0.809] 0.803 [0.776, 0.826] 0.688 [0.645, 0.732] RobustSl eepNet Signal 0.823 [ 0.790, 0.850] 0.527 [0.500, 0.553] 0.836 [0.818, 0.852] 0.791 [0.758, 0.818] 0.809 [0.780, 0.836] SalientSleepNet Signal 0.789 [0.735, 0.833] 0.468 [0.439, 0.495] 0.827 [0.809, 0.844] 0.738 [0.701, 0.770] 0.837 [0.811, 0.860] SeqSleepNet Signal 0.839 [ 0.813, 0.860] 0.435 [0.400, 0.468] 0.850 [0.832, 0.867] 0.811 [0.787, 0.831] 0.850 [0.826, 0.871] SleepDG Signal 0.817 [0.754, 0.864] 0.466 [0.437, 0.492] 0.833 [0.816, 0.849] 0.812 [0.784, 0.835] 0.816 [0.791, 0.839] TinySleepNet Signal 0.799 [0.758, 0.834] 0.448 [0.393, 0.495] 0.819 [0.801, 0.837] 0.838 [0.813, 0.860] 0.800 [0.775, 0.823] U- Sleep Signal 0.783 [0.737, 0.821] 0.407 [0.380, 0.436] 0.834 [0.818, 0.850] 0.777 [0.753, 0.799] 0.797 [0.772, 0.821] U- Time Signal 0.728 [0.669, 0.779] 0.425 [0.401, 0.449] 0.839 [0.822, 0.853] 0.818 [0.793, 0.839] 0.763 [0.728, 0.795] XSleepNet Signal 0.694 [0.644, 0.740] 0.316 [0.281, 0.351] 0.806 [0.783, 0.826] 0.789 [0.763, 0.812] 0.701 [0.649, 0.746] SleepXViT Image 0.814 [0.774, 0.850] 0.428 [0.393, 0.459] 0.830 [0.812, 0.845] 0.839 [0.816, 0.858] 0.711 [0.674, 0.747] ResNet - 18 Image 0.799 [ 0.738, 0.850] 0.487 [0.457, 0.515] 0.826 [0.808, 0.842] 0.829 [0.802, 0.852] 0.795 [0.770, 0.818] SleepVLM (ours) Image 0.833 [ 0.797, 0.861] 0.486 [0.456, 0.514] 0.849 [0.835, 0.863] 0.855 [0.834, 0.873] 0.806 [0.783, 0.828] 10 C lassificat ion performance values are point estim ates with 95% confidence intervals [ lower, upper] derived from 1,000 subject - lev el cluster bootstrap resamples. The best result in each c olumn within each dataset panel i s shown in bold. 11 Supplementary T able 5 | Per - class F1 scores for ablation configuratio ns on MASS - SS1 and ZUAMHCS Configuration W N1 N2 N3 R MASS - SS1 (n=53) Zero - shot baseli ne 0. 005 [0.003, 0.007] 0.077 [0.069, 0.083] 0.140 [0.131, 0.149] 0.122 [0.093, 0.152] 0.005 [0.003, 0.007] SleepVLM 0.917 [0.904, 0.929] 0.527 [0.498, 0.555] 0.872 [0.861, 0.884] 0. 784 [0.722, 0.827] 0.863 [0.842, 0. 880] w/o W PT 0.910 [0.896, 0.922] 0.380 [0.338, 0.423] 0. 858 [0.846, 0.869] 0. 766 [0.698, 0.817] 0.853 [0.829, 0.874] w/o Rule Grounding 0.902 [0.885, 0.916] 0. 530 [0.501, 0.557] 0.877 [0.866, 0.887] 0.765 [0.703, 0.810] 0.852 [0.826, 0.872] +RFT 0.892 [0.874, 0.908] 0.354 [0.312, 0.394] 0. 863 [0.851, 0.874] 0. 772 [0.713, 0.817] 0.809 [0.775, 0.837] w/o Coars e Annot. 0. 819 [0.783, 0.847] 0.272 [0.234, 0.308] 0.838 [0.823, 0.851] 0.769 [0.716, 0.806] 0.801 [ 0.773, 0.826] w/o Coars e Annot. & WPT 0.578 [0.507, 0.637] 0. 198 [0.177, 0.218] 0.812 [0.797, 0.826] 0.709 [0.643, 0.761] 0.625 [0.574, 0.668] Quantized (W4A16) 0.912 [0.899, 0.924] 0. 518 [0.490, 0.544] 0.865 [0.852, 0.877] 0.786 [0.724, 0.829] 0.860 [ 0.840, 0.876] ZUAMHCS (n=100) Zero - shot baseli ne 0. 009 [0.007, 0.011] 0.069 [0.063, 0.076] 0.116 [0.112, 0.121] 0.319 [0.293, 0.344] 0.005 [0.003, 0.006] SleepVLM 0.833 [0.797, 0.861] 0.486 [0.456, 0.514] 0.849 [0.835, 0.863] 0. 855 [0.834, 0.873] 0.806 [0.783, 0. 828] w/o W PT 0.824 [0.789, 0.854] 0.411 [0.379, 0.442] 0. 837 [0.823, 0.851] 0. 848 [0.826, 0.867] 0.796 [0.775, 0.817] w/o Rule Grounding 0.828 [0.790, 0.860] 0. 438 [0.404, 0.473] 0.838 [0.821, 0.854] 0.846 [0.825, 0.865] 0.777 [0.751, 0.803] +RFT 0.802 [0.740, 0.851] 0.392 [0.357, 0.424] 0. 830 [0.812, 0.846] 0. 840 [0.817, 0.859] 0.818 [0.793, 0.840] w/o Coars e Annot. 0. 735 [0.693, 0.773] 0.270 [0.234, 0.305] 0.825 [0.807, 0.842] 0.832 [0.809, 0.851] 0.737 [ 0.708, 0.765] w/o Coars e Annot. & WPT 0.406 [0.353, 0.453] 0. 204 [0.177, 0.229] 0.788 [0.771, 0.805] 0.812 [0.786, 0.835] 0.647 [0.611, 0.682] Quantized (W4A16) 0.827 [0.790, 0.859] 0. 443 [0.412, 0.472] 0.837 [0.822, 0.852] 0.854 [0.833, 0.871] 0.796 [ 0.772, 0.819] Classificat ion performance values are point esti mates with 95% confidence interval s [lower, upper] derived from 1,000 subject - lev el cluster bootstrap resamples. SleepVLM is the proposed full model (shown in bold). WPT, wav eform - perceptual pre - t raining; RFT, rejection sampling fine - tuning; W4A16, 4 - bit weight with 16 - bit activation post - training quantization (I ntel AutoRound). ‘Fine + Coarse’ denotes 5 fine - annotated and 45 coarse - annotated MASS S S3 subjects; ‘Fine only’ denot es 5 fine - annotated subjects onl y. 12 Supplement ary T able 6 | Expe rt ra ting score distribut ions ac ross mo del conf iguratio ns Configuration Dimens ion Mean Score (Count) 0 1 2 3 4 5 MASS - SS1 SleepVLM Factual Acc uracy 4.24 1 13 23 28 222 243 Evidence Com prehensiveness 4.14 1 27 18 13 261 210 Logical Coherenc e 4.15 1 32 13 14 248 222 w/o W PT Factual Acc uracy 3.95 1 27 26 47 272 157 Evidence Com prehensiveness 4.07 1 15 26 13 322 153 Logical Coherenc e 4.10 1 16 23 17 304 169 w/o Rule Grounding Factual Acc uracy 4.02 2 33 16 32 265 182 Evidence Com prehensiveness 4.09 2 25 22 13 279 189 Logical Coherenc e 4.03 2 29 29 26 251 193 +RFT Factual Acc uracy 3.88 1 35 21 54 275 144 Evidence Com prehensiveness 3.94 1 31 23 17 332 126 Logical Coherenc e 3.98 1 28 26 16 312 147 ZUAMHCS SleepVLM Factual Acc uracy 4.09 0 11 0 20 40 327 503 Evidence Com prehensiveness 4.01 0 96 30 32 452 390 Logical Coherenc e 4.04 0 129 32 20 312 507 w/o W PT Factual Acc uracy 3.96 0 124 15 56 388 417 Evidence Com prehensiveness 3.98 0 86 37 17 530 330 Logical Coherenc e 4.05 0 94 25 22 451 408 w/o Rule Grounding Factual Acc uracy 3.85 0 153 20 33 412 382 Evidence Com prehensiveness 3.89 0 130 36 19 441 374 Logical Coherenc e 3.99 0 136 26 20 345 473 +RFT Factual Acc uracy 4.01 0 103 19 54 417 407 Evidence Com prehensiveness 3.98 0 92 24 14 556 314 Logical Coherenc e 4.09 0 94 21 19 435 431 Score distribut ions for each evaluation dimension across four model configurations on MASS - SS1 (530 stratified samples per c onfiguration) and ZUAMHCS (1,000 st ratified samples per configurati on). Each row reports the number of samples recei ving each integer sc ore (0 – 5) and the mean score. Fact ual Accuracy, Factual Acc uracy & Perceptual Fidelity; Evidence Comprehensiveness , Di agnostic Evidence Comprehens iveness; Logical Coherenc e, Logical Coherence & Guideline Conc ordance. The complete rating scale definitions are provided in Supplementary Table 1. 13 Supplementary T able 7 | T raining, quantization, and inference hyperparameters Training Phase 1 (WPT) Phase 2 (SFT) Phase 3 (RFT) LoRA rank (r) / α / dropout 16 / 32 / 0.05 16 / 32 / 0.05 16 / 32 / 0.05 LoRA applied layers All nn.Linear excl uding lm_head nn.Linear in language model, excluding lm_head nn.Linear in language model, excluding lm_head Vision encoder Unfrozen Frozen Frozen Optimizer AdamW (β ₁ =0.9, β ₂ =0.95, ε=10 ⁻ ⁸ ) AdamW (β ₁ =0.9, β ₂ =0.95, ε=10 ⁻ ⁸ ) AdamW (β ₁ =0.9, β ₂ =0.95, ε=10 ⁻ ⁸ ) Learning rate 10 ⁻ ⁴ 10 ⁻ ⁴ 10 ⁻ ⁴ Learning rate schedule Linear warmup (3%), linear decay Linear warmup (3%), linear decay Linear warmup (3%), linear decay Weight decay 0.1 0.1 0.1 Gradient clippi ng (max norm) 1.0 1.0 1.0 Epochs 2 15 5 Per - devic e batch size 4 6 6 Gradient accumulation steps 1 1 1 Effe ctive bat ch si ze 32 48 48 Post - training quantizat ion Framework Intel Aut oRound (≥0.9.5 ) Scheme W4A16 (4 - bit integer weights, 16 - bit fl oating - point activat ions) Quantized lay ers All nn.Linear in 36 language model transform er blocks Layers retained at original precisi on Vision encoder, lm_head Group size 128 Optimization iterat ions 200 Calibration s amples 5,000 (strat ified by s leep stage f rom SFT training s et) Calibration s equence length 3,140 tokens Inference Framework vLLM Precision bfloat16 (float16 for quantized model) Temperature 10 ⁻ ⁶ Top -p 0.8 Maximum new tokens 1024 Training was performed on a single node with 8 NVIDIA A100 GPUs (80 GB each) using bfloat16 mixed pre cision. WPT, waveform - percept ual pre - training; SFT, rule - grounded superv ised fine - tuni ng; RFT, rej ection sampl ing fine - tuning. 14 Suppleme ntary Note 1 | System pr ompt for Phase 1 wave form - perceptual pre - tra ining The foll owing sys tem prompt wa s pro vided to t he mo del duri ng P has e 1 wa vefor m - perceptual pre - training (WPT). The model received a single 30 - second PSG wav eform image and was trai ned to predict per - second spectral an d amplit ude feat ures for each visible cha nnel. The pr ompt text is s ho wn ver batim in t he s haded box be low . # Task Analyze a 30-second PSG wa veform image and estimate the following features at each second: EEG/EOG channels: freque ncy band power (delta, the ta, alpha, beta) in dB, pl us signal amplitude (MAV) in µV EMG channel (Chin): musc le tone (MAV) in µV # Image Render ing Param eters You must interpret the ima ge according to the follow ing fixed amplitude scales : EEG (F4-M1, C4-M1, O2-M1 ): –50 µV to +50 µV EOG (LOC, ROC): –50 µV t o +50 µV Chin EMG: –40 µV to +40 µV Channel-to-color mapping: F4-M1 (yellow), C4-M1 (gre en), O2-M1 (red), LOC (cya n), ROC (magenta), Chin EMG (blue) . Channel order: from top to bottom: F4-M1, C4-M1, O2- M1, LOC, ROC, Chin EMG. Note: Some channels may be missing in the image. Onl y output channels that are visible in the image. # Output Forma t { "F4-M1": [[d,t,a,b,mav], ...], // 30 arrays, 5 values each "C4-M1": [[d,t,a,b,mav], ...], "O2-M1": [[d,t,a,b,mav], ...], "LOC": [[d,t,a,b,mav], ...], "ROC": [[d,t,a,b,mav], ...], "Chin": [[mav], ...] // 30 arrays, 1 value each } EEG/EOG channels: Each con tains 30 arrays (seconds 1 –30), with 5 values per ar ray: [delta, theta, alpha, beta , mav]. The first 4 values are band powers in dB; th e 5th value (mav) is the Mean Ab solute Value of signal amp litude in µV. All values u se 1 decimal place. Chin EMG channel: Contains 30 arrays (seconds 1–30), with 1 value per array: [ mav] representing muscle tone v ia Mean Absolute Value in µV (1 decimal place). 15 Suppleme ntary Note 2 | System pr ompt for Phase 2 supervi sed fine - tuning The sys tem prom pt be low wa s us ed dur ing P has e 2 rule - grounde d supe rvis ed f ine - t uni ng. The fine - grained and coar se annotati on train in g trac ks shared an identic al prompt , with the sole exce ptio n of Sections 4 and 5 (noted in - line). Throu ghout this No te, s hade d boxe s cont ai n the ve rba tim pr ompt text pr ovi ded t o the mo del; unsha ded text provide s e ditor ial c onte xt f or t he r eade r . Section 1: Roles a nd T asks You are an experienced sle ep technician. Your task i s to analyze an image sequ ence containing three consecuti ve 30-second PSG epochs an d provide an accurate slee p stage and a detailed, precise, r ule-based rationale for th e central target epoch. Sectio n 2A: Imag e Renderin g Parameters You must interpret the ima ges according to the follo wing fixed amplitude scale s: EEG (F4-M1, C4-M1, O2-M1 ): –50 µV to +50 µV EOG (LOC, ROC): –50 µV t o +50 µV Chin EMG: –40 µV to +40 µV Channel-to-color mapping: F4-M1 (yellow), C4-M1 (gre en), O2-M1 (red), LOC (cya n), ROC (magenta), Chin EMG (blue) . Channel order: from top to bottom: F4-M1, C4-M1, O2- M1, LOC, ROC, Chin EMG. Key interpreta tion note s: High-amplitude estimation: If the vertical amplitude of a slow wave occupies m ore than 75% of its channel he ight, it meets the amplitu de criterion for stage N3 (>75 µV peak-to-peak). Signal overlap as corrobor ation: A clear visual indi cation of extremely high amplitude is when a wavefo rm physically overlaps or extends into the display a rea of an adjacent channel. You m ust interpret such overlap as definitive evidence th at the amplitude exceeds the chan nel boundary and satisfies the >75 µV threshold. Section 2B: AASM V ersion 3 Sle ep Staging R ules The syste m prompt incl uded the com plet e text of all 15 AA SM V ersion 3 adult sleep stagi ng rules , provi ded verba tim to the mo del. This s ec tion c ompr ise d ap proxim ate ly 1,000 wor ds a nd c overe d: ge nera l s corin g pr inc iples (ep oc h scori ng, dom ina nce pri ncipl e, p rim ar y channe ls); definiti ons of ke y wave form s a nd e vents (alpha r hythm , t heta wave s, sl ow wa ve a cti vity , eye bl inks, REMs , SE Ms, K com ple xes, sle ep s pindl es, low chin EM G t one, saw tooth waves, aro usals) ; and th e full scorin g criter ia for each st age (Rules W. 1 – W . 3, N 1.1 – N1. 2, N2 .1 – N2.4, N3 .1, R.1 – R.3, MBM.1 – MBM. 2). T o a void re dundanc y , the rule text i s not reprod uce d here ; the o per ationa liz ed ve rsion o f these ru les is p resented in T a ble 4 of the main tex t. Section 3: Input Data Below, an image sequence c ontaining three consecutiv e 30-second PSG epochs wil l be provided, namely: 16 the preceding epoch N−1 the target epoch N (the central image) the subsequent epoch N+1 Your analysis must focus o n the central image labele d as the target epoch N, b ut you may consider the preceding and subsequent epochs to understand dynamic changes . Section 4: T asks an d Instr uctio ns The fine - gra ine d annot atio n tra ck inc lude s al l five s teps below . The coar se a nnotati on tra ck om its Step 4; this difference i s marked with a bracket ann otation i n the pro mpt box. Please strictly follow the steps below: 1. Analyze the context: Us ing the preceding (epoch N −1) and subsequent (epoch N+1) epoch images, analyze the dynamics of the target epo ch N. Note that this is to understand trends, but you r final judgement must be based on observing the tar get epoch N itself. 2. Identify key features: In the target epoch N, ide ntify all key waveforms an d features in great detail. Your description must incl ude: channel names, occurr ence times, waveform types, fre quencies, amplitude estima tes. 3. Cite rules: Explicitly identify the specific AASM rule numbers from the kno wledge base that support your cla ssification. 4. Generate the rationale [fine - grained t rack only] : Integrate your findings i nto a professional, concise rati onale written without firs t-person pronouns. You mus t include quantitative or se mi-quantitative descriptio ns of amplitude, frequency , duration, etc. 5. Format the output: Your final analysis result may only be provided in the following JSON format. Section 5: Output Format This is the o nly s ect ion whe re the tw o tra ining t rac ks dive r ge in thei r expe cte d outp ut sc hem a. Bot h JSO N te mpla tes are shown below . (a) Fine - graine d an not ation tra ck { "reasoning_text": "", 17 "applicable_rules": [""], "sleep_stage": "" } (b) Coar se annot ati on tr ack { "applicable_rules": [""], "sleep_stage": "" } The so le difference b etween the two t racks is that the fine - grain ed track includes th e reasoning_text field contai ning th e model’ s complete rationale , whereas t he coar se track omits this field and requi res only the ap plicable rule id entifier s and the predi cted sleep stage. During infer ence, th e model alway s uses the fin e - grai ned out put schem a to generate full rationale regar dles s of h ow it w as t raine d. 18 Supplem entar y Refer ences 1 . O’Rei lly , C., Gossel in, N., Car rier, J. & Nielsen, T . Mon treal Archive of S leep Stud ies: an op en - access r esource for inst rum ent benc hmar king a nd explorat ory res ear ch. J. Sleep Res. 23 , 62 8 – 635 ( 2014) .

SleepVLM: Explainable and Rule-Grounded Sleep Staging via a Vision-Language Model

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment