12-13_Analiza de regresie -  ºi de corela-  þie 1 3122012

download 12-13_Analiza de regresie -  ºi de corela-  þie 1 3122012

of 17

Transcript of 12-13_Analiza de regresie -  ºi de corela-  þie 1 3122012

  • 7/28/2019 12-13_Analiza de regresie -i de corela-ie 1 3122012

    1/17

    Master MSS 20122013 1

    Metode de analiza si prognoza pentru managementul sanitar

    Analiza de regresie1i de corelaie

    Efectuarea de prognoze economice privind valorile variabilei endogene Y n funcie dediferitele valori exogeneXpresupune verificarea i eventual, acceptarea ipotezei c legitatea dedependen dintre Y i X este corect specificat i identificat, avnd un caracter de relativstabilitate i repetabilitate.

    Primul scop al analizei de regresie este de arta cum este legat o variabil Y de una saumai multe variabile Xcu ajutorul unei ecuaii care d posibilitatea de a previziona variabileledependente Y n funcie de valorile cunoscute ale variabilelor independente X (x1, x2, , xn). Ingeneral, prin analiza de regresie se face o comparaie statistic a relaiilor anterioare ntre diferiifactori.

    Dependena statisticeste o dependen care se manifest nu ntre elemente i fenomeneindividuale, ci ntre colectiviti de fenomene. Msurile de asociere elaborate de statistica

    matematic permit depistarea i ierarhizarea dependenelor statistice, care se manifest ntrefenomenele i procesele istorice. Msurile de asociere statistic deschid astfel posibilitateadescoperirii legitilor statistice specifice acelor relaii de condiionare dintre fenomenele i

    procesele istorice, care prezint caracteristici statistice cuatificabile.Tabelul 1. Analiza de regresie i analiza de corelaie

    Prin anal iza regresieise nelege o clas demetode prin care, folosind o ecuaie deregresie determinat pe baza unor dateexperimentale, pot fi estimate (previzionate)

    valorile unor variabile date, presupunndcunoscute ori previzionate valorile altor

    variabile.

    Analiza corelaiei are ca obiectiv evaluareagradului de interdependen (asociere) ntrevariabilele considerate ntr-un model deregresie, n particular ntre variabiladependent i cele independente (obiectivcare se realizeaz prin estimarea coeficienilorde corelaie i a coeficientului dedeterminare).

    Natura stochastic a modelului de regresie face ca valoarea lui Y s nu poat fi prevzut exact,incertitudinea aprnd ca rezultat la mrimea aleatoare e(eroarea). Distribuia probabilistic a luiY i caracteristicile sale sunt determinate de valorile lui e i de distribuia sa probabilistic.Ipotezele de aplicare ale metodelor de regresie sunt:

    - variabilele YiXnu sunt afectate de erori de msurare. Legitatea de dependen a lui yieste condiionat de realizarea valorilorx1,x2, , xn ale variabilei exogeneX;

    - variabila aleatoare (rezidual) este de medie 0, iar dispersia ei este independent de X(ipoteza de homoscedasticitate2 - se admite c legtura dintre YiXeste relativ stabil);

    - valorile variabilei reziduale nu sunt autocorelate (nu depind unele de altele);- legea de probabilitate a variabilei reziduale este legea normal cu media 0 i abatere

    standard Sy/x.

    Dac aceste ipoteze se verific, metoda celor mai mici ptrate asigur obinerea unorestimatori de maxim verosimilitate. Respectarea acestor ipoteze permite aplicarea unor testestatistice:

    a. verificarea semnificaiei estimatorilor funciei de regresie (aplicarea unor testestatistice3);

    1The equation used to draw the best-fit straight line is called a regression equation and was first used by Sir Francis

    Galton (1822-1911) to show that when tall or short couples have children their heights tend to regress, or revert tothe mean height of their parents.2Homoscedasticitatea este o proprietate a variaiei termenului de perturbare dintr-o ecuaie de regresie n care

    aceast variaie rmne constant n toate cazurile observate (condiie impus ca estimatorul celor mai mici ptrates fie cel mai bun estimator liniar).

  • 7/28/2019 12-13_Analiza de regresie -i de corela-ie 1 3122012

    2/17

    Master MSS 20122013 2

    b. verificarea verosimilitii modelului de ajustarec. elaborarea de prognoze pe baza unui interval de ncredere.

    In general, previziunile bazate pe analiza regresiei se refer la:- valori medii condiionate ale variabilelor dependente (condiionarea fa de valori date ori

    prognozate ale variabilelor independente);

    - valori individuale ale valorilor dependente Y.Ambele tipuri de previziuni se obin din ecuaia de regresie determinat pe baza datelorexperimentale: se obin aceleai valori numerice, deosebirea constnd n semnificaia acestorvalori i n nivelul lor de precizie al estimrilor astfel obinute. Pentru estimarea unei valoriindividuale a variabilei dependente, nivelul de precizie este mai mic dect n cazul estimrii uneivalori medii condiionate a variabilei respective.

    Evaluarea erorilor de previziune se realizeaz folosind estimri cu intervale de ncredere,o astfel de estimare fiind cu att mai bun cu ct lungimea intervalului este mai mic i nivelulde semnificaie mai apropiat de 1.

    Interpretarea statistic a rezultatelor regresiei

    Baza informational pentru modelul liniar xbaY * Serii de date

    - pentru variabila explicativ/independent/exogen:x1,x2, ...xn- pentru variabila explicat/dependent/endogen:y1, y2, ...yn

    Calculul coeficientului a: xbya

    Calculul coeficientului b:n

    i

    n

    i

    n

    i

    n

    i

    n

    i

    iiii

    xxn

    yxyxn

    b

    1 1

    22

    1 1 1 saun

    i

    i

    n

    i

    ii

    xx

    yyxx

    b

    1

    2

    1

    )(

    )()(

    Calculul valorilor ajustate: ii xbaY *

    Evaluarea erorilor de previziune se realizeaz folosind estimri cu intervale de ncredere,o astfel de estimare fiind cu att mai bun cu ct lungimea intervalului este mai mic i nivelulde semnificaie mai apropiat de 1. In general, un interval de ncrederecu nivelul de ncredere ,

    )1,0( , pentru o caracteristic numeric a unei variabile aleatoare este un interval de numerereale de forma: (-t , +t )

    unde:

    este o estimare a caracteristicii de interes,

    este o msur a mprtierii estimrilor posibile,tse determin din tabelele asociate unor repartiii probabilistice uzuale.Extremitile t ale unui interval de ncredere cu nivel de ncredere se stabilesc

    astfel nct s se poat spune c exist 100%anse ca estimarea a caracteristicii cercetate sse abat cu cel mult t de la valoarea real a acestei caracteristici (n mod echivalent, se spunec exist 100 (1-)%anse s omitem o eroare mai mic dect ). Din acest motiv, nivelul dencredere se alege apropiat de 1 (de regul, 0,95 sau 0,99), echivalent cu faptul c diferena 1-(numit iprag/ nivel de semnificaie) este apropiat de zero.

    Prinanaliza de corelaiese urmrete:

    3 Untest statisticeste o mrime calculat pentru testarea ipotezelor. In condiiile ipotezei nule H0, aceast mrimestatistic urmeaz o distribuie de probabilitate pe care nu ar urmao n condiiile ipotezei alternative. Cu ctvaloarea mrimii statistice de test se abate de la valorile critice ale distribuiei, cu att este mai puin plauzibil caipoteza nul s fie adevrat.

  • 7/28/2019 12-13_Analiza de regresie -i de corela-ie 1 3122012

    3/17

    Master MSS 20122013 3

    msurarea gradului de interdependen ntre variabila dependent Y i variabileleindependenteXi, interdependen explicat prin ecuaia de regresie utilizat;

    evaluarea gradului de asociere ntre variabilele independente, atunci cnd ecuaia de regresieconine cel puin dou variabile independente Xi. Aceasta arat n ce msur dou valorisunt legate ntre ele intensitatea legturii este exprimat cu ajutorul a doi indicatori:

    coeficientul de corelaie (R) msoar puterea relaiei de dependen liniar printr-o

    valoare numeric ntre 1 i 1;

    )yny()xnx(

    yxnyxR

    2k

    2k

    kk

    o DacR = 0nu exist corelaie de tip liniar ntre YiX(dar pot exista alte tipuri dedependen, de exemplu, neliniar)

    o DacR > 0 i apropiat de valoarea 1, atunci creterile factoruluiXvor determinacreteri ale variabilei Y

    o DacR < 0 i apropiat de -1, atunci scderi ale factorului X vor determina scderipentru Y.

    coeficientul de determinare (R2) care msoar reducerea relativ n variaia lui Y ce poate fi

    atribuit cunoaterii factorilorXii a relaiei Y = f(X).

    1n

    2n

    S

    S1R

    2

    2x/y

    2tot

    2exp2

    De exemplu, o valoare R2=0.76indic c aproximativ 76%din variaia total a variabileiY poate fi explicat prin variabilele dependente X incluse n model (o valoare 0.8 esteconsiderat acceptabil).

    Exemplu:

    =CORREL(valori pentru x, valori pentru y)

    Coeficientul corectat de determinare2

    R se folosete atunci cnd numrul de observri esteegal cu numrul coeficienilor estimai (deoarece fiecare punct de observare se va situa pefuncia de regresie, mrimea eantionului trebuie s fie suficient de mare pentru a estimacoeficienii de regresie):

    )R1(kn

    1kRR 22

    2

    unde:

    nreprezint numrul de observaii realekeste numrul coeficienilor de regresie.

    In cazul regresiei multiple, R

    2

    sau

    2

    R reprezint o msur a efectului combinat alansamblului variabilelor independente asupra variabilei dependente.Exemplu:

    =RSQ(valori pentru y, valori pentru x)

    Semnificaia statistic a parametrilor modelului

    Distribuia t (Student)4 se folosete n testele ipotezelor pe eantioane mici i n care

    variana variabilei respective trebuie estimat n raport cu datele. Este o distribuie deprobabilitate n form de clopot, n care valoarea medie este egal cu zero, dispersia variabilelor

    4Testul teste testul cel mai des utilizat n analizele economice cantitative i este definit ca raportul dintre o variabil

    normal i o variabil 2 mprit la numrul gradelor de libertate.

  • 7/28/2019 12-13_Analiza de regresie -i de corela-ie 1 3122012

    4/17

    Master MSS 20122013 4

    n jurul valorii medii fiind dependent de gradele de libertate5dictate de mrimea eantionului.Gradele de libertate arat numrul de elemente informaionale care pot varia independent unul dealtul; se spune c un eantion de n observaii are n grade de libertate. De exemplu, calculareaunei medii simple a eantionului implic pierderea unui grad de libertate deoarece variaiileindependente n n-1din observaiile din eantion vor necesita o schimbare compensatorie n celde al n lea grad de libertate, pentru a se menine valoarea medie a eantionului. Tot astfel,

    calcularea valorilor pentru un numr de kparametri n cadrul unui exemplu econometric implicpierderea a kgrade de libertate, rmnnd (n-k).Dac erorile sunt distribuite normal se ateapt ca aproximativ 68% dintre valorile lui y

    s fie situate ntr-un interval mai mic de e (eroarea standard de previziune) uniti fa devaloarea medie, sau 95% la mai puin de 2 e sau 99% la mai puin de 3 e .

    Fiecare din parametrii estimai este caracterizat de o eroare standard deoarecedeterminarea lor se face pe baza unui eantion de date; probabil un alt eantion ar duce laobinerea altor valori ale parametrilor modelului.

    Valoarea aproximativ a statisticii t de verificare a semnificaiei coeficienilormodelului se calculeaz cu relaia:

    uluicoeficientaestimatadardtanseroarea

    ipotezaprinuluicoeficientvaloareaestimatcoeficientt

    Se realizeaz excluderea din model a oricrui coeficient pentru care 0,2_calct . Orice

    coeficient pentru care 0,2t este diferit de zero la un nivel de semnificaie de aproximativ 5%.

    Includerea n model a unor coeficieni cu valori absolute ale statisticii testului tsubstanial maimici dect 2.0 va spori numrul parametrilor modelului i va duce la reducerea preciziei

    prediciei.

    Tabelul 1. Interpretarea valorilor p

    p

  • 7/28/2019 12-13_Analiza de regresie -i de corela-ie 1 3122012

    5/17

    Master MSS 20122013 5

    rcoeficientul de corelatie ntreXand Y;SDYi SDXsunt deviatiile standard ale varibilelor Y i X.Pentru a aprecia semnificaia estimatorilor:- pentru un set de date de volum 30n se aplic testul t(Student) cu n-2 grade de

    libertate6;

    - pentru 30n se aplic testul z7al distribuiei normale8formulnd ipotezele:H0: a=0 i b=0

    Ha: 0a i 0b

    Dac ta

    ata

    calc)( i t

    bbt

    b

    calc )( atunci ipotezaH0se respinge i se apreciaz

    ca ai b sunt semnificativi din punct de vedere statistic.Regul: dacabs(t_calculat) > t_tabelar, atunci se respinge H0.

    Observaie: valorile tabelate pentru t

    Se apeleaza la functia

    =TINV(probabilitate, nr grade libertate)

    Exemplu:

    =TINV(0.05;15)t_tab (95%) 2,131449536

    Sau

    =TINV(0.10;15)t_tab (95%) 2,131449536

    Distributia t descrie o familie de distributii dependente de marimea esantioanelor. Pentru un esantion ce continemai mult de 30 murori, distributia t devine identiccu o distributie normal deci pentru entioane mari putemfolosi ambele tipuri de distribu (z sau t) pentru calculul intervalului de ncredere (Confidence interval).Intervalul de ncredere corespunzator mediei unor esantioane cu numr mic de msuratori (normal sau aproapenormal distribuite):

    Limita superioara interval de incredere =medie+valoare t * abatere standard/radacina patratica din n.

    Limita inferioara interval de incredere =medie - valoare t * abatere standard/radacina patratica din n.

    6n tabelul distribuiei t, valorile sunt grupate n functie de nivelul de semnificaiealpha ('significance level') i de

    gradul de libertate - df. Pentru a gsi valoarea lui ttrebuie s folosim tabelul distributiei t i s cunoatem nivelul desemnificaieigradele de libertate. Pentru a calcula numrul gradelorde libertate (df) se folosete relatia df = n1unde n reprezintnumrul valorilor din setul de date.7Se folosete funcia =NORMSINV din programul EXCEL8Teorema de limit centralstabilete c suma (i media) unei mulimi de variabile aleatoare urmeaz o distribuie

    normal, dac eantionul este suficient de mare, indiferent de forma distribuiei de la care provine variabilaindividual. Teorema este folosit adesea pentru a explica ipoteza de normalitate a termenului de eroare n studiuleconometric, care permite folosirea testului statistic tpentru testarea ipotezelor, deoarece acest termen de eroare sepresupune c nglobeaz suma unei mulimi aleatoare de factori necunoscui (omii).

  • 7/28/2019 12-13_Analiza de regresie -i de corela-ie 1 3122012

    6/17

    Master MSS 20122013 6

    valoarea t =1,96este asociat cu o probabilitate de 0,05 (pentru limita la dreapta) sau cu oprobabilitate de 0,025 (pentru limitare n ambele extreme)exemplu:

    =TINV(0.05;10000)

    t_tab (95%) 1,960201185

    Observaie:De asemenea, cu EXCEL se poate determina probabilitatea p asociat valorii calculate a lui t.

    n acest caz, p = TDIST(ABS(t_calculat), grade libertate, 2).

  • 7/28/2019 12-13_Analiza de regresie -i de corela-ie 1 3122012

    7/17

    Master MSS 20122013 7

    Regul: Dacp este mai mare dect (nivelul de semnificaie)9, ipoteza H0se accept.

    Tabelul 2. Interpretarea riscului de acceptare / respingere a H0Concluzia

    Nu respinge Respinge

    Situaia

    real

    H0 este

    adevrat

    Decizie corect Eroare de tipul I (risc de tip )

    H0este fals Eroare de tipul II(risc de tip )

    Decizie corect

    Eroarea de tip Ieste dat de respingerea ipotezei nule atunci cnd, de fapt, aceasta ar fi trebuitacceptat;

    - se confirm/valideaz o ipotez care nu este adevrat- impact: concluzii gresite care pot duce la identificarea unor soluii/decizii inadecvate

    Eroarea de tip II este urmarea acceptrii ipotezei nule cnd, de fapt, aceasta trebuie respins:- n fapt, se ignor/ se pierde un efect important-

    n consecin, se pot trata dou alternative/ opiuni ca identice dei, n realitate, acesteasunt diferite. Verificarea veridicitii modelului are la bazprincipiul analizei dispersionale.

    Tabelul 3.

    Sursa de variaie Msura variaiei Gradul de

    influen

    Grade de

    libertate

    Dispersii

    corectate

    Explicat prin

    model

    2

    i )yy(

    2tot

    2lexp

    1

    1

    2lexp

    Rezidual 2ii )yy(

    2

    tot

    2

    rez n-2

    2

    2

    n

    rez

    Total 2i )yy(

    1 n-1

    Se poate demonstra c raportul

    2ii

    2

    i

    2rez

    2exp/

    )yy(

    )yy(

    este o variabil aleatoare cu o distribuie Fisher Snedecor.

    Dac FF 2rez

    2

    exp/pentru n-k, respectiv kgrade de libertate atunci variaia luiy este

    explicat de variaia luix.

    Raportul2

    2

    2

    )(

    )(

    yy

    yyRR

    i

    i se numete raport de corelaie i exprim

    gradul de fidelitate a modelului fa de dependena statistic dintre YiX. Semnificaia statistica luiR se poate testa cu testul F (Fisher-Snedecor);

    9De regul, =0.05.

  • 7/28/2019 12-13_Analiza de regresie -i de corela-ie 1 3122012

    8/17

    Master MSS 20122013 8

    dac tabcalc FFR

    RnF

    2

    2

    1)2( pentru n-k, respectiv kgrade de libertate atunci

    Reste semnificativ (n cazul regresiei liniare simple).

    )()1(

    )1(

    )(expvar

    )1(expvar2

    2

    knR

    kR

    knlicataneiatie

    klicataiatieFcalc

    Valoarea testului Fse folosete pentru a testa semnificaia coeficienilor de regresie; setesteaz ipoteza potrivit creia variabila dependent este statistic necorelat cu variabileleindependente incluse n model.

    Pentru determinarea lui F_tab, se apeleaz la functia=FINV(probabilitate, nr grade libertate1; nr grade libertate2)

    Exemplu:

    =FINV(0.05;1;15)F_tab 4,543077123

    Ipoteza nul H0 se formuleaz astfel: variana explicat este egal cu varianarezidual; testul F se calculeaz ca raport ntre cele dou variane i compar rezultatul cu ovaloarea critic tabelatFcrit.

    dac ipoteza H0 nu poate fi respins, atunci ponderea variaiei explicate va avea opondere mic n variaia total a modelului de regresie. La limit, dac R2=0, atunci F=0. Pemsur ce valoareaFcrete, ipoteza c variabila Ynu este dependent statistic de variabilele Xconsiderate devine mai uor de respins.

    dac Fcalc>Ftab ipoteza nul poate fi respins (coeficienii de regresie au semnificaiestatistic).

    Anexa statistic:

    Dreapta de regresie bxay~ unde: a=M[A]; b=M[B].

    Valorii experimentaleyii corespunde pe dreapta de regresie valoarea estimat, yyM~][ .

    ii bxay~

    Abaterile valorilor realeyi, dar necunoscute, fa de valorile estimate y~

    (de pe dreapta de regresie) sunt:

    iiiiii bxaybxayyy )(~

    Parametrii ai bse determin din condiia ca suma abaterilor ptratice s fie minim:n

    i

    ii bxayR1

    2)(

    Pentru aceasta se deriveaz expresia lui R, adica

    n

    i

    ii bxayR1

    2)( n raport cu a i b i se

    egaleaz cu zero:

    0)(

    0)(

    1

    2

    1

    2

    n

    i

    ii

    n

    i

    ii

    bxayba

    R

    bxayaa

    R

    Se ajunge astfel la sistemul de ecuaii:

  • 7/28/2019 12-13_Analiza de regresie -i de corela-ie 1 3122012

    9/17

    Master MSS 20122013 9

    0)(

    0)(

    1

    1

    i

    n

    iii

    n

    iii

    xbxay

    bxay

    Din prima ecuaie se obine prin substituie:n

    i

    n

    i

    i

    n

    i

    iii xbyxn

    by

    nbxy

    na

    1 11

    1)(

    1

    n

    i

    iii

    n

    i

    iii xxxbyyxbxxbyy11

    0)]()[()(

    De unde:

    2

    1

    2

    1

    1

    1

    )(

    )(

    xnx

    yxnyx

    xxx

    xyy

    bn

    i

    i

    n

    i

    ii

    n

    i

    ii

    n

    i

    ii

    Rezult astfel urmtoarele expresii ale parametrilor:

    22

    22

    xnx

    yxnyxb

    xxnx

    yxnyxya

    i

    ii

    i

    ii

    )(~22

    xxxnx

    yxnyxybxay i

    i

    ii

    i

    Intervale de ncredere pentru parametrii estimaiMetoda regresiei nu necesit nici o ipotez asupra legii de repartiie a variabilei aleatoare y. Aceast

    variabil aleatoare are media teoretic M[y]=A+Bx, iar dispersia constant pentru toate valorile lui xi egal cu 2(valoare n general necunoscut). Dac repartiia luiyeste normal i observaiile sunt fcute la ntmplare se poateconstrui un interval de ncredere pentru parametrii dreptei de regresie. Dispersiile parametrilor ai b sunt date derelaiile:

    2

    22

    )( xxib

    2

    2

    22

    )(

    1

    xx

    x

    n i

    a

    Cu ajutorul estimatorilor punctuali2a i

    2b se pot construi intervale de ncredere pentru i conform

    celor prezentate anterior. Deoarece2

    este n general necunoscut, intervalul se poate determina considernd

    dispersia rezidual diferit de abaterile variabilei y n raport cu valorile dreptei de regresie (valorile estimate) y exprimat de relaia:

    n

    i

    i yyn

    sx

    y

    1

    22)~(

    2

    1

    i care definete variabila aleatoarestudentcu n-2 grade de libertate.

    Statistica 2

    2)2(

    xvsn

    are o repartiie2

    .

    Pentru un nivel de semnificaie se obin urmtoarele intervale de ncredere bilaterale ale valoriloradevrateAiB:

  • 7/28/2019 12-13_Analiza de regresie -i de corela-ie 1 3122012

    10/17

    Master MSS 20122013 10

    22

    ,22

    )( xx

    st

    xnx

    yxnyxB

    in

    i

    ii xy

    2

    2

    ,2 )(

    1

    2 xx

    x

    nstxbyA

    i

    nx

    y

    Intervalul de ncredere a valorii medii*y estimate prin regresie pentru un x

    cunoscut

    Considernd determinai parametrii ai bai dreptei de regresie, pentru o valoare cunoscut (dat) yx ,*

    va avea n medie valoarea**

    xbay .

    Variabila aleatoare normal normat 2~*

    /]~[~y

    yMy i variabila 22 /)2(x

    ysn cu repartiii2,

    cu n-2 grade de libertate. n acest caz pentru un nivel de semnificaie se obine intervalul de ncredere bilateral acrui relaie are expresia:

    2

    2*

    ,2

    ***

    )()(1~]~[

    2 xxxxstyBxAyM

    i

    nx

    y

    Metoda regresiei multiple

    Variabila dependent Y este pus n dependen de variabilele Xk considerate factoriexplicativi pentru nivelul i al caracteristicii :

    inn2i21i10i xa...xaxaaY

    (ecuaia de regresie n form aditiv)

    saun21

    a

    in

    a

    2i

    a

    1i0ix....xxaY

    (n form multiplicativ).Distincia ntre cele dou forme este fundamental pentru interpretarea economic a

    coeficienilor de regresie:

    - n cazul liniar, un coeficient ak, k=1,,n reprezint panta variaiei variabile Y fa devariabila explicativ Xk, adic modificarea lui Y ca urmare a variaiei cu o unitate anivelului luiXk(n ipoteza c toi ceilali factori rmn constani),

    - n cazul neliniar, un coeficient ak reprezint coeficientul de elasticitate al variabileiexplicate Y n funcie de variabila explicativ Xk (arat modificarea procentual a

    variabilei rezultative Yatunci cnd factorulXkvariaz cu un procent i toi ceilali factorisunt constani).

    Metoda regresiei logistice

    Regresia logistic modeleaz relaia dintre o mulime de variabile independente x i (categoriale,continue) i o variabil dependent dihotomic (nominal, binar) Y. O astfel de variabildependent apare, de regul, atunci cnd reprezint apartenena la dou clase, categorii

    prezen/absen, da/nu etc. Ecuaia de regresie obinut, de un tip diferit de celelalte regresiidiscutate, ofer informaii despre: importana variabilelor n diferenierea claselor, clasificarea unei observaii ntr-o clas.

    De remarcat c diagrama de mprtiere a valorilor nu ofer nici un indiciu n privintadependenelor. n asemenea cazuri, regresia liniar clasic nu ofer un model adecvat.

  • 7/28/2019 12-13_Analiza de regresie -i de corela-ie 1 3122012

    11/17

    Master MSS 20122013 11

    Presupunem c valoriley (variabil binar) sunt codificate 0/1, valoarea 1 exprimnd n generalapariia unui anumit eveniment, astfel nct ceea ce se caut este o estimare a probabilitii de

    producere a respectivului eveniment n funcie de valorile variabilelor independente.

    Cazul unei singure variabile independenteModelul este:

    x

    x

    e

    exyP1

    )1(

    Sau

    xxyP

    xyP)

    )1(1

    )1(ln( .

    Cantitatea din partea stng este numit (transformarea) logit a probabilitii P(y=1|x).Semnificaia expresiei P(y=1|x) este evident: probabilitatea de realizare a valorii y=1

    condiionat de valoareax. Cu alte cuvinte, probabilitatea de clasare a observaieix n clasay=1,sau probabilitatea ca valoareax s fie asociat cu producerea evenimentuluiy=1.

    In continuare se noteaz P(y=1|x) cu p, conform notaiei de la modelul probabilistbinomial (probabilitatea de succes).Transformarea logit este necesar pentru a proiecta probabilitateap din intervalul (0,1) n

    intervalul (- , + ), fapt necesar n procesul de estimare a parametrilor. Modelul este legatdirect de noiunea de odds (raport de anse), notat OR (odds report):

    p

    pOR

    1

    care reprezint raportul dintre probabilitatea de succes i probabilitatea de insucces.

    Modelul se mai poate scrie:

    xep

    p

    1.

    Pentru determinarea coeficienilor de regresie, se foloseste SOLVER din EXCEL,prin calulul:

    L

    L

    e

    exp

    1)(

    )1())(1()()( ii

    y

    i

    y

    ii xPxpx

    Maximizarea logaritmului din funcia de probabilitate

    1

    )(ln(maxi

    ix

  • 7/28/2019 12-13_Analiza de regresie -i de corela-ie 1 3122012

    12/17

    Master MSS 20122013 12

    Anexa: Metode bazate pe verificarea ipotezelor

    In diferite stadii de analiza caracteristicilor numerice ale unei colectiviti statistice aparedeseori necesitatea formulrii i a verificrii unor ipoteze privind natura sau valorile unor

    parametri pentru variabilele aleatoare teoretice asociate caracteristicilor studiate. Orice

    presupunere privind repartiia sau caracteristicile variabilei aleatoare X, formulat pe baza unor

    informaii apriorice privind variabila aleatoareXse numete ipotez statistic.Pe baza informaiilor disponibile, analistul/cercettorul face o ipotez privind

    caracteristica numit ipotez de bazi notat H0, fa de care pot exista una sau mai multeipoteze alternativeHa. Pentru simplitate, putem considera c, fa de ipoteza de bazH0, exist osingur ipotez alternativHa(dac ipotezaH0 este fals, atunci este adevrat alternativa saHa).

    Dac o ipotez statistic urmeaz a fi acceptat sau respins n funcie de datele uneia saumai multor selecii se spune c se testeaz aceast ipotez, ipoteza testat fiind numit ipotez debaz sau ipotez nul; prin ipoteza alternativse nelege o ipotez care poate fi adevrat atuncicndH0este fals i care ar putea fi acceptat atunci cnd ipoteza de baz este respins.

    Pentru verificarea ipotezelor statistice se folosesc metode specifice numite teste statistice.

    Prin test statisticse nelege o metod conform creia, pe baza datelor unei selecii, o ipotez debaz este fie acceptat fie respins.

    Dac ipoteza nulH0are o singur alternativHa, iar n urma unui test statistic sedecide respingerea ipotezei H0, atunci se accept ipoteza Ha. Dac ipoteza nul are maimulte alternative, atunci respingerea ipotezei nule implic acceptarea uneia dintrealternativele sale, fr a se preciza care dintre acestea este adevrat.Regula de decizie conform creia se accept sau se respinge ipoteza nul are la baz un

    criteriu de testare(n general, se folosete o funcie de selecie aleas n mod convenabil). FieH0o ipotez statistic de baz; o funcie de selecie C(x,n)se numete criteriu de testare a ipotezei

    H0dac sunt ndeplinite urmtoarele condiii:a. repartiia variabilei aleatoare C(X,n) depinde de faptul dac ipoteza Ho este

    adevrat sau fals;b. n cazul n care H0 ar fi adevrat, atunci C(X,n) are repartiia completspecificat.

    In general n testarea ipotezeiH0 decurge astfel:- se fixeaz o mulime de valori de numere reale I, care, de regul, este un

    interval. MulimeaIse numete regiune de respingere sau regiune critic;- se face o selecie de volum ndin colectivitatea studiat, obinndu-se succesiv

    valorile x1, x2, ..., xn pentru caracteristica numeric analizat. DacI)x...,,x,C(x n21 , atunci ipoteza nul H0este acceptat; n caz contrar, H0

    este respins.Atunci cnd se testeaz o ipotez statistic se pot produce erori:

    - dei ipoteza de bazH0este adevrat, aceasta se respinge n urma testrii; apare ceea cese numete eroare de tipul I;

    - dei ipoteza H0 este fals, aceasta se accept c ar fi adevrat; o astfel de eroare senumete eroare de tipul II.

    Evident, atunci cnd se testeaz o ipotez statistic, este de dorit ca pericolul comiterii unei eroris fie ct mai mic posibil.

    Prin nivel de semnificaie (alpha) al unui test statistic se nelege probabilitatea maximacceptat de comitere a unei erori de tipul I.

    Probabilitatea comiterii unei erori de tipul II se numete r isc de tipul I I, se noteaz cu.Modul n care a fost definit criteriul de testare ofer posibilitatea controlului erorilor de tipul I iII.

    Pentru controlul erorilor de tipul II, n locul riscului de tipul II - se mai foloseteputerea testul ui =1-, definit ca probabilitatea respingerii ipotezei nule, atunci cnd aceastaeste fals.

  • 7/28/2019 12-13_Analiza de regresie -i de corela-ie 1 3122012

    13/17

    Master MSS 20122013 13

    Anexa 2: Funcii ptr. aplicarea metodei regresiei n EXCEL

    Excel includes several array functions for performing linear regression - LINEST, TREND, FORECAST, SLOPE,

    and STEYX - and exponential regression - LOGEST and GROWTH. These functions are entered as array formulas

    and they produce array results. You can use each of these functions with one or several independent variables. The

    following list provides a definition of the different types of regression:

    Linear regression produces the slope of a line that best fits a single set of data. Based on a year's worth of sales

    figures, for example, linear regression can tell you the projected sales for March of the following year by givingyou the slope and y-intercept (that is, the point where the line crosses the y-axis) of the line that best fits the sales

    data. By following the line forward in time, you can estimate future sales, if you can safely assume that growth

    will remain linear.

    Exponential regression produces an exponential curve that best fits a set of data that you suspect does not

    change linearly with time. For example, a series of measurements of population growth will nearly always be

    better represented by an exponential curve than by a line.Multiple regression is the analysis of more than one set of data, which often produces a more realistic

    projection. You can perform both linear and exponential multiple regression analyses. For example, suppose you

    want to project the appropriate price for a house in your area based on square footage, number of bathrooms, lot

    size, and age. Using a multiple regression formula, you can estimate a price, based on a database of information

    gathered from existing houses.

    =INTERCEPT(known_y's,known_x's)Known_y's is the dependent set of observations or data.

    Known_x's is the independent set of observations or data.

    RemarksThe arguments should be either numbers or names, arrays, or references that contain numbers.If an array or reference argument contains text, logical values, or empty cells, those values are ignored;

    however, cells with the value zero are included.

    If known_y's and known_x's contain a different number of data points or contain no data points,

    INTERCEPT returns the #N/A error value.

    The SLOPE function returns the slope of the linear regression line. The slope is defined as the vertical distancedivided by the horizontal distance between any two points on the regression line. Its value is the same as the first

    number in the array returned by the LINEST function. In other words, SLOPE calculates the trajectory of the line

    used by the FORECAST and TREND functions to calculate the values of data points. The SLOPE function takes theform =SLOPE(known_y's, known_x's).

    =SLOPE(known_y's,known_x's) where:Known_y's is an array or cell range of numeric dependent data points.

    Known_x's is the set of independent data points.

    Remarks

    The arguments must be either numbers or names, arrays, or references that contain numbers.

    If an array or reference argument contains text, logical values, or empty cells, those values are ignored;however, cells with the value zero are included.

    If known_y's and known_x's are empty or have a different number of data points, SLOPE returns the #N/A

    error value.

    LINEST Calculates the statistics for a line by using the "least squares" method to calculate a straight line that best

    fits your data, and then returns an array that describes the line. You can also combine LINEST with other functions

    to calculate the statistics for other types of models that are linear in the unknown parameters, including polynomial,

    logarithmic, exponential, and power series. Because this function returns an array of values, it must be entered as an

    array formula.

    The equation for the line is:

    y = mx + b or

    y = m1x1 + m2x2 + ... + b (if there are multiple ranges of x-values)

    where the dependent y-value is a function of the independent x-values. The m-values are coefficients corresponding

    to each x-value, and b is a constant value. Note that y, x, and m can be vectors. The array that LINEST returns is

    {mn,mn-1,...,m1,b}. LINEST can also return additional regression statistics.

    The LINEST and LOGEST functions return only the y-axis coordinates used for calculating lines and curves. The

    difference between them is that LINEST projects a straight line and LOGEST projects an exponential curve.

    LINEST(known_y's,known_x's,const,stats)

    Known_y's is the set of y-values you already know in the relationship y = mx + b.

  • 7/28/2019 12-13_Analiza de regresie -i de corela-ie 1 3122012

    14/17

    Master MSS 20122013 14

    If the array known_y's is in a single column, then each column of known_x's is interpreted as a separate

    variable.

    If the array known_y's is in a single row, then each row of known_x's is interpreted as a separate variable.

    Known_x's is an optional set of x-values that you may already know in the relationship y = mx + b.

    The array known_x's can include one or more sets of variables. If only one variable is used, known_y's and

    known_x's can be ranges of any shape, as long as they have equal dimensions. If more than one variable isused, known_y's must be a vector (that is, a range with a height of one row or a width of one column).

    If known_x's is omitted, it is assumed to be the array {1,2,3,...} that is the same size as known_y's.

    Const is a logical value specifying whether to force the constant b to equal 0.

    If const is TRUE or omitted, b is calculated normally.

    If const is FALSE, b is set equal to 0 and the m-values are adjusted to fit y = mx.

    Stats is a logical value specifying whether to return additional regression statistics.If stats is TRUE, LINEST returns the additional regression statistics, so the returned array is

    {mn,mn-1,...,m1,b;sen,sen-1,...,se1,seb;r2,sey;F,df;ssreg,ssresid}.

    If stats is FALSE or omitted, LINEST returns only the m-coefficients and the constant b.

    Statistic Description

    se1,se2,...,sen The standard error values for the coefficients m1,m2,...,mn.

    seb The standard error value for the constant b (seb = #N/A when const is FALSE).

    r2 The coefficient of determination. Compares estimated and actual y-values, and ranges in valuefrom 0 to 1. If it is 1, there is a perfect correlation in the sample there is no difference betweenthe estimated y-value and the actual y-value. At the other extreme, if the coefficient of

    determination is 0, the regression equation is not helpful in predicting a y-value. For information

    about how r2 is calculated, see "Remarks" later in this topic.

    sey The standard error for the y estimate.

    F The F statistic, or the F-observed value. Use the F statistic to determine whether the observed

    relationship between the dependent and independent variables occurs by chance.

    df The degrees of freedom. Use the degrees of freedom to help you find F-critical values in a

    statistical table. Compare the values you find in the table to the F statistic returned by LINEST to

    determine a confidence level for the model. For information about how df is calculated, see

    "Remarks" later in this topic. Example 4 below shows use of F and df.

    ssreg The regression sum of squares.

    ssresid The residual sum of squares. For information about how ssreg and ssresid are calculated, see"Remarks" later in this topic.

    You can use the F statistic to determine whether these results, with such a high r2 value, occurred by chance.

    Assume for the moment that in fact there is no relationship among the variables, but that you have drawn a rare

    sample of 11 office buildings that causes the statistical analysis to demonstrate a strong relationship. The term

    "alpha" is used for the probability of erroneously concluding that there is a relationship.

    Fand dfin LINEST output can be used to assess the likelihood of a higher F value occurring by chance.

    F can be compared with critical values in published F-distribution tables or Excels FDIST can be used to calculate

    the probability of a larger F value occurring by chance.The appropriate F distribution has v1and v2degrees of freedom. Ifnis the number of data points and const=TRUE

    or omitted, then v1=ndf1and v2=df. (If const = FALSE, then v1=ndfand v2=df.)Excels FDIST(F,v1,v2) will return the probability of a higher F value occurring by chance.=FDIST(x,degrees_freedom1,degrees_freedom2)

    X is the value at which to evaluate the function.Degrees_freedom1 is the numerator degrees of freedom.

    Degrees_freedom2 is the denominator degrees of freedom.

    RemarksIf any argument is nonnumeric, FDIST returns the #VALUE! error value.

    If x is negative, FDIST returns the #NUM! error value.

    If degrees_freedom1 or degrees_freedom2 is not an integer, it is truncated.

    If degrees_freedom1 < 1 or degrees_freedom1 10^10, FDIST returns the #NUM! error value.

    If degrees_freedom2 < 1 or degrees_freedom2 10^10, FDIST returns the #NUM! error value.FDIST is calculated as FDIST=P(F>x), where F is a random variable that has an F distribution withdegrees_freedom1 and degrees_freedom2 degrees of freedom.

  • 7/28/2019 12-13_Analiza de regresie -i de corela-ie 1 3122012

    15/17

    Master MSS 20122013 15

    FINV

    Returns the inverse of the F probability distribution.

    If p = FDIST(x,...), then FINV(p,...) = x.

    The F distribution can be used in an F-test that compares the degree of variability in two data sets. For example, you

    can analyze income distributions in the United States and Canada to determine whether the two countries have a

    similar degree of diversity.

    =FINV(probability,degrees_freedom1,degrees_freedom2)

    Probability is a probability associated with the F cumulative distribution.

    Degrees_freedom1 is the numerator degrees of freedom.

    Degrees_freedom2 is the denominator degrees of freedom.

    Remarks

    If any argument is nonnumeric, FINV returns the #VALUE! error value.

    If probability < 0 or probability > 1, FINV returns the #NUM! error value.

    If degrees_freedom1 or degrees_freedom2 is not an integer, it is truncated.

    If degrees_freedom1 < 1 or degrees_freedom1 10^10, FINV returns the #NUM! error value. If degrees_freedom2 < 1 or degrees_freedom2 10^10, FINV returns the #NUM! error value.

    FINV can be used to return critical values from the F distribution. For example, the output of an ANOVAcalculation often includes data for the F statistic, F probability, and F critical value at the 0.05 significance level. To

    return the critical value of F, use the significance level as the probability argument to FINV.

    FINV uses an iterative technique for calculating the function. Given a probability value, FINV iterates until the

    result is accurate to within 3x10^-7. If FINV does not converge after 100 iterations, the function returns the #N/Aerror value.

    ExampleFINV(0.01,6,4) equals 15.20675

    Calculating the t-Statistics

    Another hypothesis test will determine whether each slope coefficient is useful in estimating the assessed value.

    The critical value can also be found using Excels TINV function.=TINV(probability,degrees_freedom)

    Probability is the probability associated with the two-tailed Student's t-distribution.

    Degrees_freedom is the number of degrees of freedom with which to characterize the distribution.

    RemarksIf either argument is nonnumeric, TINV returns the #VALUE! error value.

    If probability < 0 or if probability > 1, TINV returns the #NUM! error value.If degrees_freedom is not an integer, it is truncated.

    If degrees_freedom < 1, TINV returns the #NUM! error value.

    TINV returns that value t, such that P(|X|>t) = probability where X is a random variable that follows the t-

    distribution and P(|X|>t) = P(Xt).

    A one-tailed t-value can be returned by replacing probability with 2*probability. For a probability of 0.05

    and degrees of freedom of 10, the two-tailed value is calculated with TINV(0.05,10), which returns

    2.28139. The one-tailed value for the same probability and degrees of freedom can be calculated with

    TINV(2*0.05,10), which returns 1.812462.

    Note In some tables, probability is described as (1-p).

    Given a value for probability, TINV seeks that value x such that TDIST(x, degrees_freedom,2) =

    probability. Thus, precision of TINV depends on precision of TDIST. TINV uses an iterative search

    technique. If the search has not converged after 100 iterations, the function returns the #N/A error value.

    The STEYX function calculates the standard error of a regression , a measure of the amount of error accrued in

    predicting a y for each given x. This function takes the form =STEYX(known_y's, known_x's).

    The TREND functionLINEST returns a mathematical description of the straight line that best fits known data. TREND finds points that

    lie along that line and that fall into the unknown category . You can use the numbers returned by TREND to plot

    a trend linea straight line that helps make sense of actual data. You can also use TREND to extrapolate, or makeintelligent guesses about, future data based on the tendencies exhibited by known data. (Be careful. Although you

    can use TREND to plot the straight line that best fits the known data, TREND can't tell you if that line is a good

    predictor of the future. Validation statistics returned by LINEST can help you make that assessment.)

    The TREND function takes the form =TREND(known_y's, known_x's, new_x's, const).

    The first two arguments represent the known values of your dependent and independent variables. As in LINEST,

    the known_y's argument is a single column, a single row, or a rectangular range. The known_x's argument alsofollows the pattern described for LINEST. The third and fourth arguments are optional. If you omit new_x's, the

    TREND function considers new_x's to be identical to known_x's. If you include const, the value of that argument

    must be TRUE or FALSE (or 1 or 0). If const is TRUE, TREND forces b to be 0.

  • 7/28/2019 12-13_Analiza de regresie -i de corela-ie 1 3122012

    16/17

    Master MSS 20122013 16

    To calculate the trend-line data points that best fit your known data, simply omit the third and fourth arguments from

    this function. The results array will be the same size as the known_x's range. To create these values, we selected the

    range ...... and entered =TREND(....., .....) as an array formula using Ctrl+Shift+Enter.

    Calculating exponential regression

    Unlike linear regression, which plots values along a straight line, exponential regression describes a curve by

    calculating the array of values needed to plot it. The equation that describes an exponential regression curve is

    y = b * m1x1

    * m2x2

    * * mnxn

    If you have only one independent variable, the equation is

    y = b * mx

    The LOGEST functionThe LOGEST function works like LINEST, except that you use it to analyze data that is nonlinear, and it returns the

    coordinates of an exponential curve instead of a straight line. LOGEST returns coefficient values for each

    independent variable plus a value for the constant b. This function takes the form

    =LOGEST(known_y's, known_x's, const, stats).

    LOGEST accepts the same arguments as the LINEST function and returns a result array in the same fashion. If you

    set the optional stats argument to TRUE, the function also returns validation statistics.Note: The LINEST and LOGEST functions return only the y-axis coordinates used for calculating lines and curves.

    The difference between them is that LINEST projects a straight line and LOGEST projects an exponential curve.

    You must be careful to match the appropriate function to the analysis at hand. The LINEST function might be moreappropriate for sales projections, and the LOGEST function might be more suited to applications, such as statistical

    analyses or population trends.

    =LOGEST(known_y's,known_x's,const,stats)

    Known_y's is the set of y-values you already know in the relationship y = b*m^x.

    If the array known_y's is in a single column, then each column of known_x's is interpreted as a separate variable.

    If the array known_y's is in a single row, then each row of known_x's is interpreted as a separate variable.

    Known_x's is an optional set of x-values that you may already know in the relationship y = b*m^x.The array known_x's can include one or more sets of variables. If only one variable is used, known_y's and

    known_x's can be ranges of any shape, as long as they have equal dimensions. If more than one variable is used,

    known_y's must be a range of cells with a height of one row or a width of one column (which is also known as a

    vector).

    If known_x's is omitted, it is assumed to be the array {1,2,3,...} that is the same size as known_y's.Const is a logical value specifying whether to force the constant b to equal 1.If const is TRUE or omitted, b is calculated normally.

    If const is FALSE, b is set equal to 1, and the m-values are fitted to y = m^x.

    Stats is a logical value specifying whether to return additional regression statistics.

    If stats is TRUE, LOGEST returns the additional regression statistics, so the returned array is {mn,mn-

    1,...,m1,b;sen,sen-1,...,se1,seb;r 2,sey; F,df;ssreg,ssresid}.

    If stats is FALSE or omitted, LOGEST returns only the m-coefficients and the constant b.

    For more information about additional regression statistics, see LINEST.

    RemarksThe more a plot of your data resembles an exponential curve, the better the calculated line will fit your data. Like

    LINEST, LOGEST returns an array of values that describes a relationship among the values, but LINEST fits a

    straight line to your data; LOGEST fits an exponential curve. For more information, see LINEST.

    When you have only one independent x-variable, you can obtain y-intercept (b) values directly by using thefollowing formula:

    Y-intercept (b):

    INDEX(LOGEST(known_y's,known_x's),2)

    You can use the y = b*m^x equation to predict future values of y, but Microsoft Excel provides the GROWTH

    function to do this for you. For more information, see GROWTH.Formulas that return arrays must be entered as array formulas.

    When entering an array constant such as known_x's as an argument, use commas to separate values in the same

    row and semicolons to separate rows. Separator characters may be different depending on your locale setting in

    Regional Settings orRegional Options in Control Panel.

    You should note that the y-values predicted by the regression equation may not be valid if they are outside the

    range of y-values you used to determine the equation.

    The GROWTH functionWhere the LOGEST function returns a mathematical description of the exponential regression curve that best fits a

    set of known data, the GROWTH function finds points that lie along that curve. The GROWTH function works like

    its linear counterpart, TREND, and takes the form =GROWTH(known_y's, known_x's, new_x's, const).

  • 7/28/2019 12-13_Analiza de regresie -i de corela-ie 1 3122012

    17/17

    Master MSS 20122013 17

    GROWTH(known_y's,known_x's,new_x's,const)

    Known_y's is the set of y-values you already know in the relationship y = b*m^x.

    If the array known_y's is in a single column, then each column of known_x's is interpreted as a separate variable.

    If the array known_y's is in a single row, then each row of known_x's is interpreted as a separate variable.

    If any of the numbers in known_y's is 0 or negative, GROWTH returns the #NUM! error value.

    Known_x's is an optional set of x-values that you may already know in the relationship y = b*m^x.The array known_x's can include one or more sets of variables. If only one variable is used, known_y's and

    known_x's can be ranges of any shape, as long as they have equal dimensions. If more than one variable is used,

    known_y's must be a vector (that is, a range with a height of one row or a width of one column).

    If known_x's is omitted, it is assumed to be the array {1,2,3,...} that is the same size as known_y's.

    New_x's are new x-values for which you want GROWTH to return corresponding y-values.

    New_x's must include a column (or row) for each independent variable, just as known_x's does. So, if known_y'sis in a single column, known_x's and new_x's must have the same number of columns. If known_y's is in a single

    row, known_x's and new_x's must have the same number of rows.

    If new_x's is omitted, it is assumed to be the same as known_x's.

    If both known_x's and new_x's are omitted, they are assumed to be the array {1,2,3,...} that is the same size as

    known_y's.

    Const is a logical value specifying whether to force the constant b to equal 1.If const is TRUE or omitted, b is calculated normally.

    If const is FALSE, b is set equal to 1 and the m-values are adjusted so that y = m^x.

    RemarksFormulas that return arrays must be entered as array formulas after selecting the correct number of cells.

    When entering an array constant for an argument such as known_x's, use commas to separate values in the samerow and semicolons to separate rows.