survey methods for agriculture farms, and institutions ..._presen… · data. the same method was...

100
United States Department of Agriculture National Agricultural Statistics Service Research Division SRB Research Report Number SRB-93-10 September 1993 - - -~ -- ~ .. "--- ~ ? SURVEY METHODS FOR BUSINESSES, FARMS, AND INSTITUTIONS INTERNATIONAL CONFERENCE ON EST ABLISHMENT SURVEYS Part I NASS Participants

Upload: others

Post on 30-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

United StatesDepartment ofAgriculture

NationalAgriculturalStatisticsService

Research Division

SRB Research ReportNumber SRB-93-10

September 1993

-- -~-- ~ .."--- ~ ?

SURVEY METHODS FOR BUSINESSES,FARMS, AND INSTITUTIONS

INTERNATIONAL CONFERENCE ONEST ABLISHMENT SURVEYS

Part I

NASS Participants

Page 2: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

SURVEY MEmODS FOR BUSINESSES, FARMS, AND INSTITUTIONS by Participants,National Agricultural Statistics Service, U. S. Department of Agriculture, Washington, DC20250-2000, September 1993, Part 1 of 2, Report No. SRB-93-1O.

ABSTRACT

Part 1 of this report is a compilation of all contributed papers presented at the InternationalConference on Establishment Surveys in Buffalo, New York, June 28-30, 1993. They have beenorganized by general subject matter. Several of these will be printed as separate and moredetailed research reports. Part 2 of this report will include monograph and invited papers.

This paper was prepared for limited distribution to the research community outside theU.S. Department of Agriculture. The views expressed herein are not necessarily those ofNASS or USDA.

ACKNOWLEDGMENTS

This report has been the culmination of the efforts of many individuals in NASS. Special thanksgo to Yolanda Grant and Shawna McClain for their many hours of assistance in preparingfor the conference and this report.

1

Page 3: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

l' ABLE OF CONTENTS

1. NEW TECHNOLOGY

Mike BellowApplication of Satellite Data to Crop Area Estimation at the County Level ..... 1

Mitch GrahamState Level Crop Estimation Using Satellite Data in a Regression Estimator ..... 7

Bruce EklundComputer Assisted Personal Interviewing (CAPI) Costs versus Paper and PencilCosts 12

n. OBJECTIVE YIELD SURVEYS

Eric WaldhausHistory and Procedures of Objective Yield Surveys in the United States 17

Tom BirkettYield Models for Corn and Soybeans Based on Survey Data 23

M. Denice McCormickUsing Different Precipitation Terms to Forecast Com and Soybean Yields .... 27

In. NONRESPONSE TO SURVEYS

Matt FetterAn Empirical Investigation of an Alternative Nonresponse Model for Estimation of HogTotals 31

Terry O'ConnorIdentifying and Classifying Reasons for Nonresponse on the 1991 Farm Costs andReturns Survey 37

Kay TurnerAn Evaluation of Nonresponse Adjustment Within Weighting Class Cells for the FarmCosts and Returns Survey 43

Diane WillimackSelected Results of the Incentive Experiment on t.he 1992 Farm Costs and ReturnsSurvey 48

11

Page 4: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

IV. SAMPLING FRAME EVALUATIONS

Jeffrey BaileyTime Related Coverage Errors and the Data Adjustment Factor (DAF) 54

Orrin MusserUnbiased Estimation in the Presence of Frame Duplication 60

Scot RumburgEstimating Agricultural Labor Totals from an Incomplete List Frame 64

Susan Cowles and Susan HicksAn Analysis of the Sampling Frame for the Chemical Use and Farm FinanceSurvey 69

Robin RoarkAgricultural Data Systems - A U.S.lCanada Comparison 73

V. SURVEY ESTIMATION

Robert HoodAn Analysis of Response Bias in the January 1992 Cattle on Feed Reinterview PilotStudy and the July 1992 Cattle on Feed Reinterview Survey 78

Susan Hicks and Matt FetterAn Evaluation of Robust Estimation Techniques for Improving Estimates of TotalHogs 84

John AmrheinAn Analysis of Variance Approach to Defining Imputation Cells for a ComplexAgricultural Survey 89

Cheryl TurnerA Comparison of Four Alternative Weighted Estimators to the Open Estimator for Usein the Agricultural Labor Survey 93

Ismael Flores-Cervantes, Bill Iwig, and Ron BoseckerComposite Estimation for Multiple Stratified Designs from a Single List Frame . 98

VI. DATA PROCESSING

Jenny Shiao and Phil ZellersDistributed Database System: 3 Critical Elements 103

III

Page 5: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

APPLICATION OF SATELLITE DATA TO CROP AREA ESTIMATIONAT THE COUNTY LEVELMichael E. Bellow, USDA/NASS

Research Division, 3251 Old Lee Highway, Room 305, Fairfax, VA 22030KEY WORDS: Battese-Fuller model, county effect.

combined ratio estimator

I. INTRODUCTION

The National Agricultural Statistics Service(NASS) of the United States Department ofAgriculture has published county estimates of cropacreage, crop production. crop yield and livestockinventories since 1917. These estimates assist theagricultural convnunity in local decision makingand are also useful to agribusinesses. The primarysource of data for agricu1tural commodityestimates has always beer surveys of farmers.ranchers and agribuSlnesses who voluntarilyprovide information on a confidential basis.However. surveys des igned and conducted at thenational and state levels a~e often inadequate forproducing reliable informatior at the county orsmall domain level. Therefore, supplementary datasources such as NASS list frdme control data.previous year estimates and Census of Agriculturedata are often used to improve county estimation.Earth resources satellite data represents a usefulancillary data source for county level estimatlonof crop planted and harvested area. The basis forimproved estimation accuracy using satellite datais the fact that. with adequate coverage. all ofthe area within a county can be classified to acrop or ground cover type. The accuracy of theestimates depends upon how accurately thesatellite data are classifIed to each crop.

NASS has used or considered several regressionbased estimators for small area crop acreageestimation with ancillary satellite data. Theseestimators use stratum level counts of pIxelsclassified to crops. From 1976 to 1982. NASS usedthe Huddleston-Ray estimator (Huddleston and Ray,1976). In 1978. the Cardenas famIly of estimators(Cardenas. Bl anchard and Craig. 1978) wasconsidered but not adopted "rom 1982-87. theAgency used the Battese-Fuller estimator (Battese.Harter and Fuller, 1988) for county levelestimation of major crops In the Midwestern grainbe 1t wi th landsa t Mu 1t ispec tra1 Scanner (MSS)data. The same method was used to calculate countyestimates of rice. cotton and soybeans in theMississippi Delta region In 1991-92 with landsatThemat ic Mapper (TM) data. Research has recentlybegun to consider non-regressIon estimators basedon overall (across strata) counts of classifIedpixels. This report discusses two such estimatorsand compares them witr the Battese-Fullerestimator.

1

Graham (1993) provides a description of themethodology used to obtain classified pixel countsand generate state and regional level crop acreageestimates. Some knowledge of those concepts ishelpful 1n the upcoming discussion.

II. BATTESE -FULLER ESTIMATOR

The Battese-Fuller approach to crop areaestimation at the county level is an extension ofthe regressi on methodology used for state levelestimation. The Battese-Fuller estimator (BFE)utilizes the analysis district (multi-county)level regresslOn, but incorporates an additionalterm that accounts for county (random) effects.

The Battese-Fuller model was fIrst developedin the gener.;l framework of linear models withnested error structure (Fuller and Battese. 1973),and later appl ied to the special case of countycrop area estimatIon (Battese. Harter and Fuller.1988) In state level estimation, a group ofcounties and parts of counties covered by one ormore satelllte scenes comprises an analysisdistrict Analysts compute regressionrelatIonshIps between NASS survey reportedacreages and counts of classified pixels. usingarea frame sample units (segments) within eachanalySIS dIstrict. The Battese-Fuller modelassumes that segments grouped by county have thesame slope relationship with classified pixels asthe analYSIS dIstrict, but the intercept term isdifferent. Cne can apply the model within ananalysis d1strict for any land use stratum where avalid regressIon relationship has been found. Theanalyst computes stratum level Battese-Fuller areaestimates for all countIes and subcounties withineach analysIS district. For 1and use strata whereregression is not feasible due to lack of adequatesatellIte coverage or too few segments, a doma1nindirect synt~etic estImator is used.

For a glllen analysis district. the stratawhere regressIon is done are here referred to asregressior strata and the remaining ones assynthetIC strata. For convenience. the regressionstrata are labelled h=l .....Hr and the syntheticstrata h=Hr + i ,... ,H, where Hr is the number ofregression strata and H is the total number ofstrata In tne analysis district. If a given countyis partially contalned in the analys1s district.then the estlmatl0n formulas given below applyonly to the included portion.

For each sample segment WIthIn a gIven stratumh in county c. the Battese-Fuller model specifiesthe following relation:

Page 6: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

where:

nhc = number of sample segments in stratumh, county c

Yhci reported acreage of crop of interest instratum h, county c, sample segment i

xhci number of pixels classified to cropof interest in stratum h, county c,sample segment i

vhc = county (random) effect for stratum h,county c

fhci = random error in stratum h, county c,sample segment i

~Oh' ~lh = analysis district level regressionparameters for stratum h

The county effect and random error are assumedto be independent and normal, with mean zero andvariances a2vh and o2eh' respectively. Therandom errors for segments within the district areassumed to be mutually independent. The countymean residuals are observable and given by:

..uhc. = Yhc. - Pbh - ~hxhc.

where:nhc

(l/nhc) I Yhcii=1

nhc(l/nhc) I Xhcii=1

..Pbh' ~h least squares regression parameter

estimators for stratum h

For a given county, the stratum level meancrop area per population unit (segment) isestimated by:

y(BF),hc.where:

Xhc = mean number of pixels per populationunit classified to crop in stratum h,county c

o < °hc :!: 1

The range of allowed values of the parameter0hc defines a family of Battese-Fullerestimators. If 0hc=O, then the estimate 1ies onthe analysis district regression line for thestratum. The value commonly used is the one thatminimizes the mean square error for stratum h incounty c (Walker and Sigman, 1982):

* 2 2 20hc = nhcovh /(nhcovh +oeh )

2

In general, the vari ance components avh2 and0eh2 are unknown and must be estimated. TheAppendix gives estimators that are a special caseof the unbiased est imators derived by Fuller andBattese (1973). using the "fitting-of-constants"method. They require that a given stratum containat least two sample segments within the county inquestion; otherwise 0hc is set to zero in thecomputation of the Battese-Fuller estimate.

The (unadj usted) stratum 1eve 1 est imator oftotal crop area in county cis:

. --T(uBF),hc = Nhc[Pbh + ~hXhc + 0hcuhc.J

where:

Nhc number of population units in stratum h,county c

The county estlmates are often adjusted to sumto the district totals obtained in state levelregression estimation. The adjusted stratum levelBattese-Fuller estimator is:

C

T(aBF),hc = T(uBF),hc - (Nhc/Nh)L 0hcuhc.c=1

where:

Nh = number of population units in stratum hC = number of counties in analysis district

The adjusted Battese-Fuller estimator of totalcrop area in the regression strata of county cis:

H- r -T(aBF).c = L T(aBF),hc

h=1

Estimation of the variance of the BFE isdescribed by Walker and Sigman (1982). Theirestimator of mean square error, used to derive thevariance estimator, is known to have a downwardbias due to estimation of the variance components.A correctlon due to Prasad and Rao (1990) may beimplemented in the future.

As mentioned previously, synthetic estimationis done in strata where regression is not viable.Since a county usually contains few segments in agiven stratum, the stratum level sample mean cropacreage over the entire analysis district is usedto compute a synthetic estimate. The estimate ofcrop area in synthetic stratum h, county cis:

where:

mean reported crop area per samplesegment in stratum h

Page 7: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

The domain indirect synthetic estimator oftotal crop area in the synthetic strata of countyc is then:

T(SYN),cH

I T(SYN),hc.h=Hr+1

based on sample level information can reduce thebias. Although a pixel count estimator could be afuncti on of counts of pixel s classi fied to manydifferent cover types. this discuss ion wi 11 berestri cted to estimators based on the number ofpixels classified to the crop of interest only. Ageneral expression for such an estimator is:

T (CR)c

where:

The final county estimate is obtained bysumming the regression and synthetic components:

The est imated va riance of the f 1 na 1 countyestimate is computed by summing the varIanceestimates of the regression and syntheticcomponents. The use of the analysis district levelaverage to estimate county totals ignores countyeffects. so the synthetic component of a countyestimate can have a significant bias.

Walker and Sigman (1982) studied the Battese-Fuller model using Landsat MSS data over a sixcounty regi on in eastern Sout h Dakota. At thattime. NASS was using the Huddleston-Ray estimator(Huddleston and Ray, 1976). whIch simply replacedthe analysis district level pixel mean in eachstratum with the county level pixel mean in theregression equation. The county effect parameterof the Battese-Fuller model was highly significantfor corn. the most prevalent in the region of thefour crops considered. The study showed robustnessof the Battese-Fuller family against departurefrom certain model assumptIons, and provided thejustification for replaclng the Huddleston-Rayestimator with the Battese-Fuller estimator foroperational county crop estimation.

III. PIXEL COUNT ESTIMATORS

As improved satell ite sensors enable hIgherclassification accuracy, the overall (acrossstrata) count of pixels WIthIn an area classifiedto a given crop or cover type becomes moreinteresting. The overall pixel count represents acensus of pixels covering the area in question andtherefore is not subject to sampling error.However, there is a nonsampling error due to pIxelmisclassi fication. As a result. the overall pIxelcount (converted to area units) is general1y abiased estimator of crop area. Adjustment factors

3

where:

Xc = number of pixels classified to crop ofinterest in county c

f'J = adjustment term

The adjustment term may be a function of thesample level classification data. The choice ofadjustment term determines the specific estimatorused. If the term is simply set to the area on theground corresponding to one pixel, then the RawPixel Count EstImator (RPCE) is obtained:

T (RPC) = Axc c

where)., IS the conversIon factor (area units perpixel) for the satellite sensor being used.

The RPCE IS bIased 1f the theoreticalcommission error (probability that a pIxelclassified to the crop of interest is from anothercover type) and omission error (probability that apixel from the crop of interest is classified toanother cover type) are not equal. The combinedratlO estimator (CRE). based on the estllnator ofthe same name described In Cochran (1977),attempts to adjust for the bias. This estimator isconceptually SImple. uses stratum levelinformation to compute the adjustment term and hasa readIly avaIlable formula for estimating thevarIance. The CRE can be expressed as follows:

H H:(L NhYh ..)!(I Nhxh ..)]Xch= 1 h= 1

An estimator for the variance of the combinedratio estimator is derived from Cochran'spopulation variance formula, valid for largesamples'

where:

Page 8: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

~h __Sxyh = (l/nh-l)L (xhi-xh ..)(Yhi-Yh ..l

i=lfh = nh/NhYhi = reported area of crop of interest in

stratum h, sample segment ixhi = number of pixels classified to crop of

interest in stratum h, sample segment ixh .. = mean number of pixels per sample

segment classified to crop of interestin stratum h

X total number of pixels classified to cropof interest

IV. EMPIRICAL EVALUATION

This section describes an empirical evaluationof the satellite based county crop area estimatorsdescribed above, performed using data from Iowaand Mississippi. The Iowa data were from a 1988research project, while the Mississippi data werefrom NASS's 1991 operational project in theMississippi Delta region (Bellow and Graham,1992). The quantity estimated was acreage plantedto a crop.

The first application area is a nine countyregion in western Iowa with a high concentrationof corn and soybean s. Ground data from NASS' s1988 June Agricultural Survey (JAS) were used forestimation, with a total sample size of 30segments from two strata. The region was coveredby one TM scene with an image date of July 25,1988. The second area, a twelve county region innorthwestern Mississippi, comprises two contiguouscrop reporting districts that accounted for mostof the state's cotton and rice production in 1991.Ground data from the 1991 JAS were used forestimation, involving 73 segments in four stratafor cotton and 59 segments in two strata for rice.The analysi s used mul titemporal satell ite datawith image dates of Apri I 1 and August 23, 1991.Two TM scenes from each date were needed to coverthe region. For both regions, all seven spectralbands from each scene were utilized. The adjustedversion of the Battese-Fuller estimator wascomputed in all cases.

For Iowa, the analysis used 30 segments, with28 coming from stratum A (agricultural) and theother two from stratum B (agri-urban). Data fromthe segments in stratum A were used for the BFE,which was computed within the subset of thatstratum covered by the TM scene. Parts of Calhoun,Crawford and Ida counties lay outside the TMscene. For the BFE, CRE and RPCE, syntheticest imat ion was appl ied wi thin stratum A for theareas outside the scene. For the BFE, syntheticestimation was used in stratum B for all areas.

4

The strata in Mississippi where Battese-Fullerestimation was used for cotton were strata A (75-100% cultivated), B (51-75%), C (15-50%) and 0 (0-15%). The BFE was applied only in strata A and Bfor rice. Synthetic estimation was used in theother strata for each crop. The TM scenes coveredall areas except for a small part of Yazoo county.

Tables 1 and 2 give the computed values of thesatellite based BFE, CRE and RPCE for Iowa andMississippi, respectively. For comparison, thesurvey based estimate (SYN) obtained by usingsynthetic estimation in all strata is also shown.Estimated standard deviations are given for theSYN, BFE and CRE. The offi cia I county pIantedacreage estimates issued by NASS's Iowa andMississippi State Statistical Offices are alsolisted. These published estimates are based onadditional survey and administrative data. Theofficial county figures for Iowa are believed tobe highly accurate indicators of corn and soybeanacreage. Rice figures are not given for Issaquena,Quitman and Yazoo counties since Mississippi didnot issue official rice estimates for thosecounties in 1991. Tables 3 and 4 give measures ofestimator accuracy for the two states, computedbased on the final official figures. The meandeviation (MO), root mean square deviation (RMSD),mean absolute deviation (MAD) and largest absolutedeviation (LAD) are shown.

Comparing the standard deviations of SyN, BFEand CRE given in Table I, it is seen that CRE hadthe lowest value for both corn and soybeans in allIowa counties considered. BFE had lower variancethan SYN in all counties for corn and all but onecounty for soybeans. Tabl e 2 shows that inMississippi, CRE had lower variance than BFE ineight of twelve counties for cotton and eight ofnine counties for rice. For both cotton and rice,SYN had higher variance than BFE and CRE in eachcounty.

Table 3 shows that for corn in Iowa, BFE hadthe lowest MAD and RMSD among the four estimatorsstudied. However, RPCE had the lowest RMSD and MADfor soybeans. From Table 4, BFE showed the lowestMAD and RMSD for cotton in Mississippi, but CREhad the lowest MAD and RMSD for rice. For all fourcrops, the survey based estimator SYN showed thehighest values of RMSD, MAD and LAD and istherefore clearly inferior to the other threeestimators. The mixed results suggest that therelative performance of the three satellite basedestimators may depend to a large degree on thespecific crop. The mean deviation of BFE wasnegative for all four crops, suggesting a possibledownward bias of this estimator.

V. SUMMARY

This paper described the current status ofsatellite based county crop area estimation in

Page 9: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

NASS. The Battese-Full er model is currentl yapplied to compute county acreage indicationsprovided to certain NASS State StatisticalOffices. Estimators based on overall pixel countshave recently begun to receive attention.Empirical results for Iowa and Mississippi suggestthat the CRE has lower variance than the BFE,while relative performance of estimators appearsto be crop specific. The BFE and CRE both showed anegative bias in the study. Future research willexplore properties of these estimators fordifferent crops and other regions.

REFERENCES

Battese, G.E., Harter, R.M. and Fuller, W.A.,(1988) "An Error-Components Model for Predictionof County Crop Areas Using Survey and SatelliteData," Journal of the AmerIcan StatisticalAssociation, vol. 83, no. 401, pp. 28-36.

Bellow, M.E. and Graham, M.L. (1992) "ImprovedCrop Area EstimatIon in the MissIssippi DeltaRegion using Landsat TM Data," ProceedinQs of theASPRS/ACSM/RT92 Convention, Wasrington, D.C., vol.4, pp. 423-432.

Evaluation o~ the Huddleston-Ray and Battese-Fuller EstImators," SRS Staff Report No. AGES820909, U.S Department of Agriculture.

APPENDIX. ESTIMATION OF BATTESE-FULLER VARIANCECOMPONENTS

The estimators of the Battese-Fuller variancecomponents at the analysis district levelrepresent a speci al case of the more generalunbi ased est i mators deri ved by Full er and Battese(1973). The variance component estimators are asfollows:

where:

Cardenas, M., Blanchard, M.M. and Craig, M.E.,(1978) "On The Development of Small AreaEstimators Using LANDSAT Data as AuxiliaryInformation," Economic. Statistics, andCooperatives Service. U.S Department ofAgriculture.

s 2uh

Cochran, 'J.G., (1977) "Sampll ng Techniques." NewYork, NY: John Wiley & Sons, pp 165-167.

Fuller, W.A. and Battese, G.E. (1973)"Transformations for EstimatIon of Linear Modelswith Nested-Error Structure," Journal of theAmerican Statistical AssoCIation, vol. 68, no.343, pp. 626-632.

Graham, M.L. (1993) "State Level Crop AreaEstimation using Satellite Data in a RegressionEstimator," ProceedinQs of the InternationalConference on Establishment Surveys, Buffalo, NY.

Huddleston, H.F. and Ray. R. (1976) "A NewApproach to Small Area Crop Acreage Estimation,"Annua 1 Meet i nQ of the Ameri can AQri cultura 1Economics Association, State College, PA.

Prasad, N.G.N. and Rao. J.N.K. (199D) "TheEstimation of the Mean Squared Error of Small-AreaEst i mators," Journal of the Amer i can Stat i st 1 ca 1Association, vol. 85, no. 409. pp. 163-171.

Walker, G. and Sigman, R. (1982) "The Use ofLANDSAT for County Est lmate~; of Crop Areas -

5

c C C n\~ 2- 2 (' 2) (' hc 2

nh L.. nhe xhe. + L nhc L L xhci )-Qhc=1 c=l c=l i=1

C

Qh 2nhxr. L nhc2xhC.c=l

The value of the quant i ty 8 hc that mi mmi zesthe mean sqJare error of the Battese-Fullerestimator can then be estimated by:

Walker and SIgman (1982) provide expressIonsfor the mean square error and mean squareconditlona! bIas of the stratum level Battese-Fuller estlmdtor. Separate formulas are requireddependi ng upon whether the regression parametersare known or estimated. Variance estimators arederIved from :hese formulas.

Page 10: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

Table 1: County Estimates for Iowa 1988 (1000 Acres)CORN:County Official SYN SO BFE SO CRE SO RPCEAudubon 100.0 11'2."4 6."5 92."2 3."2 93."6 2."1 100.6Calhoun 133.0 144.9 8.3 133.2 3.9 134.4 2.9 144.2Carroll 141.0 146.2 8.4 141.4 4.5 142.1 3.1 152.6Crawford 147.0 183.2 10.6 152.7 4.7 155.1 3.2 164.9Greene 125.0 145.9 8.4 130.0 3.9 132.8 2.9 142.7Guthrie 98.0 151.3 8.7 106.3 5.2 107.8 2.4 115.8Ida 112.0 111.4 6.4 107.0 4.0 107.0 3.8 110.3Sac 136.0 148.1 8.5 138.3 4.0 139.6 3.1 150.0Shelby 155.0 149.4 8.6 140.7 4.0 141.5 3.1 152.1SOYBEANS:County Official SYN SO BFE SO CRE SO RPCEAudubon 70.7 74":"0 7."5 69."9 4."6 7ii":'"4 2."1 74.8Calhoun 150.0 95.4 9.6 145.0 5.8 136.9 4.0 145.2Carro 11 117.0 96.1 9.7 106.7 9.7 106.4 3.1 113.0Crawford 106.0 120.4 12.1 106.9 5.8 108.1 3.1 113.8Greene 143.0 96.1 9.7 117.5 5.4 109.6 3.2 116.3Guthrie 77.5 99.5 10.0 64.4 7.0 78.8 2.3 83.7Ida 75.2 73.3 7.4 76.4 5.3 76.1 4.3 78.2Sac 124.0 97.3 9.8 112.9 5.5 108.8 3.2 115.5Shelby 94.9 98.3 9.9 81.0 6.0 91.1 2.7 96.7

Table 2: County Estimates for Mississippi 1991 (1000 Acres)COTTON:County Official SYN SO BFE SO CRE SO RPCEBolivar 65.5 1'5'6."2 15."4 GT:"6 6."1 60':"6 3."9 80.6Coahoma 105.7 59.2 8.4 88.3 4.2 82.6 5.2 109.8Humphrey 61.6 53.2 7.2 57.3 3.4 54.2 3.4 72 .1Issaquena 38.0 42.6 8.6 34.6 3.9 27.5 1.8 36.6Leflore 79.2 68.8 9.6 87.8 3.5 83.4 5.3 111.0Quitman 31.0 48.1 7.2 46.4 4.0 44.5 2.8 59.3Sharkey 47.0 43.2 6.9 48.6 3.4 42.5 2.7 56.6Sunfl ower 100.0 95.6 15.0 79.3 5.5 73.9 4.7 98.3Tallahatchie 64.2 68.9 10.5 67.9 4.9 60.3 3.8 80.3Tunica 45.6 47.1 6.9 38.0 2.5 36.5 2.3 48.6Washington 95.7 84.4 11.6 102.4 4.0 93.2 5.9 124.1Yazoo 94.5 89.3 23.4 93.9 7.5 81.9 5.2 108.9RICE:County Official SYN SO BFE SO CRE SO RPCEBol ivar 74.0 5'0:"8 1r9 66."2 3."6 66."9 6."1 60.9Coa homa 15.8 20.3 4.7 10.4 2.5 10.7 1.0 9.7Humphreys 3.6 22.8 5.2 7.1 2.3 4.7 0.4 4.3Leflore 16.6 30.7 7.1 19.4 3.6 17.3 1.6 15.8Sharkey 5.0 18.0 4.1 7.8 1.7 6.5 0.6 5.9Sunflower 36.0 51.1 12.0 37.8 3.5 36.7 3.4 33.4Tallahatchie 9.6 20.9 5.1 8.5 3.0 8.1 0.7 7.4Tunica 17.5 17.6 4.3 9.9 2.6 13.0 1.2 11.9Washington 30.5 39.6 9.0 22.6 3.5 28.0 2.6 25.4

Table 3: Iowa Estimator AccuracyCORN SOYBEANS

EST MO RMSO MAD LAD MO RMSO MAD LADBFE -Q."6 6:8 5.4 14.3 -8."6 11.9 9.1 25.5RPCE 9.6 12.6 10.6 17.9 -2.3 10.3 7.4 26.7CRE 0.8 7.4 6.3 13.5 -8.0 13.5 9.0 33.4SYN 16.2 23.8 17.6 53.3 -12.0 28.0 21.6 54.6

Table 4: Mississippi Estimator AccuracyCOTTON RICE

EST MO RMSO MAD LAD MO RMSO MAD LADBFE -r8 10.0 T:8 20:7 -2."1 5:2 4.5 7.9RPCE 13.2 17.2 13.7 31.8 -3.8 5.6 4.1 13.1CRE -7.2 12.5 10.2 26.1 -1.9 3.5 2.7 7.1SYN -1.8 19.4 13.2 46.5 7.0 13.9 12.2 23.2

6

Page 11: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

STATE LEVEL CROP AREA ESTIMA nON USING SATELLITE DATAIN A REGRESSION ESTIMATOR

Mitchell L. Graham, USDAINASS3251 Old Lee Hwy. Rm. 305 Fairfax, Virginia 22030

KEY WORDS: regression estimator, LandsatThematic Mapper, land cover estimate, crop acreageestimate.

ABSTRACTThe USDA's National Agricultural Statistics Service(NASS) estimates state level crop acreage in theMississippi Delta region using area frame survey dataand Landsat Thematic Mapper (TM) satellite data.Five general steps produce these acreage estimates.First, a sample of TM pixel data is clustered by landcover. Second, sampled TM pixels are assigned to aland cover class using maximum likelihoodclassification. Third, classified sample pixels areregressed with reported crop acreages. Fourth, TMscenes are classified. Finally, acreage is estimatedwith a regression estimator using classified pixelcounts as ancillary information to the ground surveydata. The potential benefit is mainly a reduction invariance with some adjustment of the state acreageestimates.

BACKGROUNDThe Mississippi River Delta region is the mostimportant rice producing area in the United States andis also a major cotton producing area. The region,which includes all or part of five states, accounted for76 percent of U.S. planted rice acreage and 29 percentof U.S. planted cotton acreage in 1991. With 1.3million planted acres of rice, Arkansas was the majorDelta rice producing state accounting for 46 percent ofthe 1991 national total. (USDA NASS, 1992). The1992 Arkansas rice estimate was 1.4 million plantedacres; the 1993 estimate was 1.35 million plantedacres (USDA NASS, 1993).

The Delta region provides an ideal setting for remotesensing based estimation techniques. NASS' s currentgeneral purpose area sampling frame is not designedfor crops that are localized in specific areas. Thiscondition can lead to high state level relative samplingerrors for crops such as cotton and rice. In Arkansas,nearly all the rice and cotton occur in the eastern thirdof the state oriented north-to-south along theMississippi River. This geographic orientationcoincides with the ground viewing orientation of polar

7

orbiting Landsat satellites and minimizes the numberof satellite scenes needed to cover Arkansas.

DATA PROCESSINGPEDITOR is used for data processing on a MicroVax3500 computer and on IBM PC compatibles in a DOSenvironment. PEDITOR is a special purpose softwaresystem developed at NASS (Ozga et al., 1992) forcrop area estimation. PEDITOR is mainly written inPASCAL and contains modules for image display andprocessing, as well as estimation. Image display andgraphics modules are run on PCs, while non-graphicsmodules can run on a either a PC or MicroVax.Computationally intensive jobs, such as classificationof multitemporal TM scenes, are processed on a Craysupercomputer (Idaho National Engineering LaboratorySupercomputing Center in Idaho Falls, Idaho).

DATA ACQUISITIONFor the 1991/92 Delta Project, NASS' s RemoteSensing Section (RSS) acquired ground data from theJune Agricultural Survey (JAS) and Landsat data fromEOSAT Corporation. Data acquisition involved theJAS, a recheck visit to JAS segments, spring TMscene selection, and summer 1M scene selection.

The ground sample units were small land areas calledsegments, each about one square mile for strata II, 12,20 and 21. Segments were selected randomly from anarea sampling frame stratified by land use categoriesordered by percent of cultivated land. See Table I.During the June survey, field enumerators interviewedthe land managers in each segment and recorded theland cover (rice, fallow, soybeans, pasture, woods,water, etc.), size, and boundaries for every field.Uncultivated areas within a segment were alsorecorded. At this point, the survey data could be usedto make NASS's usual preliminary crop area estimateshaving measurable precision, but based on ground dataalone. Mid-summer, RSS rechecked segments wherea farmer indicated, during the JAS, that a crop wouldbe planted later.

Using knowledge of cropping practices, analystsselected Landsat TM scene dates to facilitate cropdiscrimination within the constraints imposed by cloud

Page 12: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

The TM scene acquisition dates and data quality affectthe organization of both analysis and estimation. Tocontrol atmospheric and phenological factors, areas ofArkansas viewed by Landsat on different dates areanalyzed and processed separately. The Landsat 5

Table 2: Landsat TM Scene Overpass Dates for1991 and 1992 Arkansas Analysis Regions.Analysis Multi- _Overpass Date_Region temporal Pass 1. Pass 2.1991

Eastern yes 4/01/91 8/23/91Central no 8/14/91

1992Northeast yes 5/05/92 7/24/92Southeast yes 5/05/92 6/22/92Central yes 4/26/92 8/16/92

cover and scene availability. TM data consists ofseven spectral measurements on each of 41.6 millionpicture elements (pixels) arranged in a 5965 by 6967array called a scene. When possible, spring andsummer Landsat TM scenes from the same area werecombined to create a single multi temporal, 14dimensional, satellite data set. Each Landsat scenewas reformatted and registered to 1:250,000 USGSmaps. Then sampled segments were digitized andlocated within each Landsat scene. When thegeographic correspondence between TM pixel data andJAS segments was established, the Landsat TM datawere analyzed by land cover.

Table 1: USDA NASS Land-use Strata forArkansas during 1991 and 1992.Stratum # Definition(1991--implemented in 1974)

11 over 80 % cultivated12 51 to 80 % cultivated20 15 to 50 % cultivated31 agri-urban: > 20 home/mile2

32 commercial: > 20 homelmile2

33 resort: > 20 home/mile2

40 less than 15 % cultivated50 non-agricultural

(1992--implemented in 1992)11 over 75 % cultivated 19521 25 to 75 % cultivated 4031 agri-urban: > 100 home/mile2 1032 commercial: > 100 homelmile2 542 less than 25 % cultivated 14040 non-agricultural 5

Analysis of sample segment classification consisted ofthree parts. First classified segment pixels weretabulated by the reflectance categories in S(edited).Next

satellite flies North to South over Arkansas in threepartially overlapping passes which cover, or view, theeastern, central and eastern regions of the state ondifferent dates. Landsat 5 repeats any given passevery 16 days with neighboring passes either seven oreleven days apart. At best, the central and easternpasses may be seven days apart. In some cases badweather requires dividing a single path (pass) into twoanalysis regions that differ by 16, 32 or more days.See Table 2 for TM scene overpass dates.

Analysts used S(edited)as input into the discriminatefunction categorizing Landsat TM pixels into separatereflectance categories. There were two phases ofmaximum likelihood classification. First the segmentpixel data were classified. Then after analysis andrefmement of segment classification, whole TM sceneswere classified.

SATELLITE DATA ANALYSISSeparately, for each land cover within each analysisregion, the segment Landsat data were studied foroutlier pixels and then clustered using a modifiedISODATA algorithm (Bellow and Ozga, 1991).Outlier pixels were identified using principalcomponent analysis and removed from the data beforeclustering. The result of clustering each land cover, ~,was several separable vectors, S~, of spectralreflectance each referred to as a signature. Thesignatures in S~were assumed to represent noticeablevariations in the land cover. For example, in Sliceseparate signatures were expected for unplanted fields,flooded fields, waste areas, fields in good or badcondition, and mixtures of rice and other covers.

When all land covers were clustered, the S~ wereassembled into one collection of signatures, S(all)' Theseparability of the land cover signatures in S(aIJ)wasanalyzed using Swain-Fu (Swain 1972) or transformeddivergence (Swain and Davis 1978) statistics. Somesignatures were separable. Most signatures had adegree of separability that would allow them to still beuseful for classification. The signatures with thepoorest separability were removed from S(all)' oraveraged with similar signatures, producing an editedcollection of signatures, S(edited).Each vector in S(edited)was still tagged with its original land cover but wasalso considered a separate category of surfacereflectance.

N

11,6732,7181,308

41818,561

35

11,7235,697

11,6735,0191,371

53210,658

889

n

144488428

44

844

8

Page 13: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

The regression estimator of total acreage for a landcover in an analysis region can be expressed as

Ha

y~(reg)== L N"h [Y~ah+ b~(X~ah - X~ah)]h~1

h = Ha+ 1,...,H for strata where the regression estimatoris not used. If the analysis region is not covered byTM dat~ h = Ha+l, ...,H.

L Ns = L Nh,=1 h-)

HH

H HL ns = L Dh .,-I h-I

A

ns = Dh == :l: nah anda=1

A

Let Ns = N h ,= L Nah ,a=1

cormmsslon and omission error based on the originalland use tags were examined using the kappa statistic(Congalton, 1991). Then segment classified pixelcounts were regressed with segment land cover totalsunivariately for each land cover. A separate first ordermodel was used in each applicable JAS land usestratum. If classification errors were acceptable andsimple linear regression analysis revealed no problemswith model assumptions nor outlier points, then thesegment classified pixel counts were used to calculatethe sample ancillary mean, and bl was used to estimatethe slope in the regression estimator. Otherwise, someof the satellite data analysis steps were repeated.

When sample level analysis was complete, analystsused S(editod)in classifying whole Landsat scenes. Aftera TM scene classification, the scene pixels weretabulated within JAS land use strata by category andland cover. These counts were use in calculating theancillary population means.

REGRESSION ESTIMATORRemote sensing researchers at NASS have usedancillary satellite information in a regression estimatorsince 1978. Analysts used the regression estimator inthis manner for land cover and crop estimationprojects with the National Aeronautics and SpaceAdministration and the National Oceanic andAtmospheric Administration (Allen and Hanuschak,1988). There is a theoretical downward bias of order1/n with this method (Cochran, 1977).

The NASS area frame stratifies each state by percentof cultivated land (Table 1.). Let s = 1,2, ...,H denotethese land use strata. In each stratum there are Nsprimary sampling units (PSU). NASS randomlyselects n, units (segments) from each stratum forenumeration during the JAS.

nah

. L(Y,ahJ-),ahf(l-R"ah 2)/(nah-2)il +(nah -3)"1]J~I

Where b<;ahis regression coefficient b] for land covery region ex and stratum h, and where

Nah

X~h == L XIah/Nah and X~i is the count of full,-I

scene pixels classified to land cover y in stratum hfrom the ith PSU in analysis region ex.

_ nah

Likewise, X\<lh = L x~/nah and x~. is the countj-I ~

of segment pixels classified to land cover cr in stratumh from the jth sample unit in analysis region ex.

~2 is the coefficient of determination between thereported acreage and classified pixel count of landcover y for stratum h in analysis region ex.

Now for the remaining analysis regions and stratawhere Landsat TM data were not used, a directexpansion estimator can be expressed as

After purchasing Landsat TM scenes covering thestudy area, NASS creates analysis regions for thediffering satellite overpass dates (Table 2.). Denotethe analysis regions ex = 1,2, ...,k.k+ 1•...,A where k ofthem are covered by Landsat data and A-k of them arenot.

Within each analysis region. there are Ha area frameland use strata where the regression estimator is used.If the region is covered by Landsat TM data (ex s k),o ::;~ ::;H. If the region is not covered by TM data(ex > k), then Ha = O. Denote the area frame land usestrata within a covered analysis region as h = 1,...,Hafor strata where the regression estimator is used and as

H nah

Y<;a(dir)== I: N ah/nah L Y~jh-Ha'i j=1

H "ahVarCY<;a(d"j)=, I: (Nah2-Nahnah)/(n2ah-nah)L (y~j --y<;ah)2

h~Ha+1 j-I

Where Y~hJ is the reported acreage of land cover yfrom segment j in stratum h from analysis region ex.The state level estimate of land cover cr using ancillaryLandsat TM data is written

A

Y 1M, == L Y (;a(reg)+ L Y<;a(dir)+ L Y <;a(di,)11=1 a'""l a-k+l

9

Page 14: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

Table 5: Arkansas State Level Relative Efficiency(RE) for All Rice.

Crop CV 01:<%)Rice (1991) 10.1Rice (1992) 6.8

Table 6: Difference in Total Planted Acreage.Direct Expansion Estimate minus RegressionMethod Estimate Scaled by Standard Error.Crop (Y DE- Y TM)/SEDE (Y DE- Y ~/SETMRice (1991) 0.50 0.99Rice (1992) 1.32 2.36

Table 3: Kappa values (k), percent correct (ct) andpercent commission (cm) for sample segments'classification - All Rice.

____ Analysis Regions _Northeastl Southeastl Central 2

k ct cm k ct cm k ct cm717527 6768277479 19 81 84 14 83 87 14

for rice. Table 5 shows state level direct expansionCV's (CVDJ, Landsat regression CV's (CVnJ, andthe RE's for rice. Table 6 shows the difference oftotal planted rice acres estimated by direct expansiononly from the estimate produced through using theregression estimator scaled by standard error. Thestate level and analysis region acreage indications(unofficial estimates) cannot be shown due toconfidentiality restrictions.

RE3.93.2

CVTM(%)5.44.1

Table 4: Regression of Reported Segment Acreagewith Segment Categorized Pixels for All Rice.

Analysis RegionsStra- Northeast! Southeastl Central 2

turnS n R2 b n R2 b n R2 b19913

11 98 .94 .194 23 .96 22212 13 .99 .204 9 ---4

19923

11 54 .95 .195 53 .98 .203 37 .98 .19121 1

___ 410 .84 .174 7 .97 .190

Coverrice (1991)rice (1992)

In 1991, both state level direct expansion andregression method indications for planted acres of ricewere below the 1991 official NASS estimate, while for1992 the official estimate was between these two 1992indications. In 1991, YDEwas closer to the officialestimate, but in 1992 Y1M differed very little from theofficial NASS estimate. YTM,l99lwas 1.28 standarderrors (S~1991) below the 1991 official rice estimate,and YTM,1992was 0.53 standard errors (S~l992) abovethe 1992 official estimate.

k k A

Var(¥ ~)= :EVar(Y ~(reg~ +:EVar(Y ~a(dir~ +:EVar(Y ~dir~a-I a-I a-I

Table 4 shows the stratum level sample sizes (n"h) and~h 2 values for those strata where regression was used

Table 3 gives the kappa statistic, and percent correctand percent commission for rice in Arkansas for 1991and 1992. Commission errors were better in 1992with substantially better classification accuracy for1992 central region.

Before submission, the acreage indications areassessed through examining statistics from each of themain processing steps. Classification accuracy,exclusion error, and inclusion error are assessed usingthe kappa statistic, percent correct and percentcommission. The regression relationship of acres withclassified pixels is analyzed for fit, outlier segmentsand appropriate slope. Since the Landsat 1M pixel isapproximately 0201 acres, then bl should be near0201. Also, the relative efficiency (RE) of the statelevel Landsat regression estimator to that of the directexpansion (JAS) estimate is noted.

For both 1991 and 1992 the central and eastern areasof Arkansas were covered by 1M scenes. Weatherconditions in each year were the fmal determinate for1M scene selection. In 1991 acceptable 1M data wereobtained only for mid-summer over the centralanalysis region while early spring and mid-summerdata were available for the eastern region .Consequently, the 1991 central region was analyzedwith unitemporal 1M data while the eastern regionwas multitemporal. In 1992, spring as well assummer imagery was available, so that mu1titemporal1M data sets were created for all regression analysisregions. But the 1992 eastern region had differingsummer image dates for northeast and southeast andwas therefore divided into two analysis regions tocontrol for atmospheric and crop progress effects. Ingeneral, classification accuracy was higher in themultitemporal analysis regions than in the unitempora1regions.

RESULTSFor 1991 and 1992, the Remote Sensing Sectionsubmitted Landsat crop acreage indications to theNASS Agricultural Statistics Board and the ArkansasState Statistical Office early in December. NASS'sAnnual Crop Production Report, published in earlyJanuary, contained crop acreages from the Decemberboard.

10

Page 15: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

SUMMARYIn 1991 and 1992, the NASS Remote Sensing Sectionestimated planted rice acreage in Arkansas usingNASS June Agricultural Survey area frame data andancillary Landsat TM data in a regression estimator.To control for phenological effects, Arkansas wasdivided into analysis regions based on TM sceneoverpass dates. Each analysis region was analyzedseparately. A regression estimator was used within theintensively cultivated land use strata for the TMcovered analysis regions; otherwise, direct expansionwas used. The state level acreage estimate was thesum of the analysis region estimates. For 1991, theregression estimator produced a state level indication(unofficial estimate) which was 1.28 standard errorsbelow the NASS official planted acres estimate forrice. In 1992, the indication was 0.53 standard errorsabove the official estimate. For each year, theregression method indication and variance were lessthan the corresponding direct expansion indication andvariance.

I The northeast and southeast regions wereanalyzed as one region in 1991 and as twoin 1992.

2 The central region was analyzed unitemporallyin 1991.

3 The Arkansas area sampling frame wasreconstructed for 1992.

4 Direct expansion was used.5 Direct expansion was used in strata which are

not listed.DE Direct expansion method--no ancillary satellite

data used.1M Method using regression estimator with

satellite data where possible and directexpansion where not.

ACKNOWLEDGEMENTSI would like to thank Paul Cook of USDA NASS forhis 1991 analysis of central Arkansas.

REFERENCESAllen, J.D., 1990a, A Look at the Remote SensingApplications Program of the National AgriculturalStatistics Service, Journal of Official Statistics,6(4):393-409.

Allen, J.D., I990b, Remote Sensor Comparison forCrop Area Estimation Using Multitemporal Data, inProceedings of the IGARSS '90 Symposium, CollegePark, Md., pp. 609-612.

11

Allen, J.D. and Hanuschak, G.A., 1988, The RemoteSensing Applications PrOgram of the NationalAgricultural Statistics Service: 1980-1987, U.S.Department of Agriculture, NASS Staff Report No.SRB-88-08.

Bellow, M.E. and Graham, M.L., 1992, ImprovedCrop Area Estimation in the Mississippi Delta RegionUsing Landsat TM Data, in Proceedings of theASPRS/ACSM Convention. Washington, D.C.

Bellow, M.E. and Ozga, M., 1991, Evaluation ofClustering Techniques for Crop Area Estimation usingRemotely Sensed Data, in American StatisticalAssociation 1991 Proceedings of the Section onSurvey Research Methods, Atlanta, Ga., pp. 466-471.

Cochran, W.G., 1977, Sampling Techniques, JohnWiley and Sons, New York, NY, ch. 7, pp 189-203.

Cook, P.W.. ]982, Landsat Registration MethodologyUsed by U.S. Department of Agriculture's StatisticalReporting Service 1972-1982, USDAINASS/RemoteSensing Section.

Congalton, R.G., 1991, A Review of Assessing theAccuracy of Classifications of Remotely Sensed Data,Remote Sensing of the Environment, 37:35-46 (1991).

Cotter, J. and Nealon, J., 1987, Area Frame Design forAgricultural Surveys, U.S. Department of Agriculture,NASS Area Frame Section.

Gong, P. and Howarth, P.J., 1992, Frequency-BasedContextual Classification and Grey-Level VectorReduction for Land-Use Identification,Photogrammetric Engineering and Remote Sensing,Vol. 58, No.4, April 1992, pp 423-437.

Johnson, R.A. and Wichern, D.W., 1988, AppliedMultivariate Statistical Analysis, Prentice Hall,Englewood Cliffs, N.J., ch. 11, pp. 501-513.

Ozga, M., Mason, W.W. and Craig, M.E., 1992,PEDITOR - Current Status and Improvements, inProceedings of the ASPRS/ACSM Convention.Washington, D.C.

U.S. Department of Agriculture, 1992, CropProduction - 1991 Summary, Agricultural StatisticsBoard, NASS.

Page 16: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

COMPUTER ASSISTED PERSONAL INTERVIEWING (CAP!) COSTSVERSUS PAPER AND PENCIL COSTS

Bruce EklundNational Agricultural Statistics Service (NASS)

3251 Old Lee Highway, 3rd Floor, Fairfax, Virginia, 22030

KEY WORDS: Blaise, Computer Assisted PersonalInterviewing (CAPI), Interactive Editing (IE)

Computer assisted personal interviewing (CAPI)promises improved data quality and timeline~s, dataquality through edits invoked during data collecuon andtimeliness by telecommunication.

The purpose of this paper is to help an organization askthe right questions. First, is CAPI feasible? Theorganization must write easy-to-use instruments for itssurveys and integrate CAPI into its survey process.

A more comprehensive question is "should anorganization use CAPI?". Issues include feasibility,whether an organization can realize CAPl's potentIalbenefits, and the topic of this paper: cost. This paperbriefly reviews experience that shows NASS can realizeCAPI benefits, describes initial cost comparisons, anddocuments a parallel CAPI and paper comparison.

CAPI Cost Analyses are Specific to the Organization

Britain's Office of Population Censuses and Surveysthought that computer assisted interviewing, includingCAPI, would not only save money,but it was imperativeto meet new budget constraints (Manners, 1991). HowCAPI affects costs, however, depends greatly on anorganization's survey program. Cost issues include thedegree of centralization, number and frequency ofsurveys, consistency of surveys over time, length of datacollection periods, and the amount and complexity ofediting.

NASS's decentralized structure will challenge CAPImanagement. NASS trains enumerators from 44 stateoffices. In each state they gather each year for eachmajor survey. Training costs already consume aconsiderable portion of survey costs. Costs include stateoffice staff and enumerator salaries, lodging, and perdiem. If CAPI training increases total training time, thenCAPI increases NASS survey costs appreciably_

12

NASS collects survey data within short periods. Also,most surveys are quarterly or annual, not weekly ormonthly. Thus, although NASS needs to write severalsurvey instruments and train for several surveys, thereare short, concentrated periods to realize CAPI benefits.

Conversely, during the concentrated periods, NASSspends considerable staff time on office editing after theinterviews. Clerks verify arithmetic calculations; dataentry clerks key and verify; and professional, subjectmatter specialists (agricultural statisticians) hand editquestionnaires. Afterward, NASS rents mainfr~ecomputer time to batch edit the data. StatIstIcIansusually wait overnight for the output, make corrections,and then run another batch edit overnight. If CAPIreduces office editing time, it will greatly affect surveycost.

NASS Experience

NASS developed CAPI for a cross section of itssurveys. The more of the survey program NASS can useCAPI, the more economically viable CAPI is. NASSwould show that for a cross section of its surveys:

1) NASS can develop easy to use survey instruments

2) emunerators can learn and use CAPI well

3) telecommunication would speed data retrieval

4) CAPI would help clean data

NASS first used CAPI operationally for a simple survey,the Livestock Prices Received. CAPI data werecompared to paper collected data with a post-collectionbatch edit. CAPI flagged suspicious data (outSIdespecified ranges) during the CAPI interviews. Theenumerator could fix an error or verify valid data CAPIcollected data had 57% fewer suspicious "error" flags inthe post-collection batch edit (Eklund, 1991).

More importantly, CAPI reduced "critical" errors by afactor of 16.5. Critical errors are non-sensible entries.

Page 17: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

For this survey, batch edit critical errors often resultedfrom missing items that made data records unusable.CAPI required enumerators to answer these itemsbefore proceeding.

Post-Collection Batch Edit "Error" Rates!

Livestock Prices Received Survey

For Pa,per& Pencil

non-critical, suspicious "error"%= 3.74%

critical error%= 0.33%

n - llOOO

For CAPI

non-critical, suspicious "error"%= 2.15%

critical error% = 0.02%

n- 12000

1) Pct. of records (data from I1vestock sales) wit'errors".

In 1989, for the September Agricultural Survey, NASSfound what other organizations have found: enumeratorscan use CAPI effectively and respondents accept CAP!.Two survey software products, CASES and Blaise weretested for CAPI.

NASS also used laptops and electronic calipers to sizealmonds to forecast production during the growingseason. Enumerators entered data both from the keyboard and the caliper. The caliper, connected to a serialport, put data directly into the survey instrument.Enumerators then sent the interactively edited data tothe State office by telecommulllcation, showing theability to reduce the time between field assessment andcrop forecasting. Furthermore, the technology caneliminate the need for a two-person team. With CAPI,one person can simultaneously measure and record.

Next NASS tested CAPI with a challenging survey, theFarm Cost and Returns Survey (FCRS). The paperquestionnaire is over thirty pages, has complexbranching, requires plenty of arithmetic, asks detailed

13

financial questions, and averages 90 minutes perinterview. It IS difficult to convey the specific intent forsome questions. Often farmers cannot respond precisely.NASS proved that Blaise CAPI software could handlethis complex questionnaire. Enumerators showed thatthey could learn and use CAPI for one of NASS's mostdifficult surveys.

Relationship between CAPI & Interactive Editing(IE)

An important part of cost comparison is the relationshipbetween CAPI, Computer Assisted TelephoneInterviewing (CATI), and after data collection,Interacti ve Editing (IE). The Netherlands' CentralBureau of Statistics developed integrated surveysoftware called Blaise. The user can write code and thencompile tn Pascal into either a CAPI, CAT!, or an IEsurvey lDstrument. After CAPI or CAT!, a subjectmatter specialist can use an IE instrument to review thedata interactively, questionnaire by questionnaire, withautomated edit checks. Programmers have only to writeone set of code for CAPI, CAT!, and IE. Cost estimatesheremafter assume NASS effectively integrates CAPIand IE software fimctions.

What Does CAPI Cost?

Understandably, NASS did not design its survey costaccounting WIthcategories conducive to contrast paperand CAPI costs. Thus ten state offices estimated costsfor several sub-categories of survey preparation,enumerator training, data editing, data entry, batchediting, and mailing costs.

A framework, starting with the FCRS project, was builtfor estimating and comparing CAPI cost. Spreadsheetscan ease updates of cost estimates as costs change andpeople learn to use CAPI more effectively.

Corresponding costs were estimated or projected forCAPI based on previous CAPI and IE applications.Although survey specific cost estimates were developedfor the FCRS, hardware costs were amortized bothacross surveys within a year and across years for

Page 18: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

hardware life expectancies. Thus, one can extrapolate

CAP/COSTSvs.

PAPER COSTS

SURVEY COSTS PAPER CAP]

Enumerator Training XHardware XSurvey Preparation xMail Costs xTelephone Costs xData Entry xData Editing X

the cost estimates for the gamut of CAPI targetedsurveys. The chart depicts major (large X) and minor(small x) cost increases. An assumption is that peopleuse CAPI/IE effectively, which excludes some initialcosts.

Not only hardware but also training cost is likely toincrease significantly with CAP!. Effective and costefficient CAPI training is challenging. The organizationmust balance or integrate effective methods such as lowstudent/teacher ratios and field observation with costefficient methods such as home self training andpractice. Also, survey trainers should build CAPIpractice into the survey schools.

Most savings will come from office editing, which isexamined in following sections. CAPI can save somesurvey preparation cost. Automated distribution ofsamples is cheaper than the present method of writing aprogram to print labels in each state office and thenmanually distributing properly labeled questionnaires toeach enumerator. CAPI also saves mail and data entrycosts but increases telephone costs somewhat withtelecommunication.

Total CAPI cost was estimated close to or slightly lessthan paper collection in 1992 (Eklund, 1993). Themargin of error of some estimates was such that thepaper method mayor may not have still been slightlyless expensive.

The important points are to have reasonable estimatesand to understand cost trends. Paper intensive costs suchas mail and office labor are increasing. CAPI costs, suchas hardware and telecommunication are decreasing(Clayton and Harrell, 1989). An organization shoulddevelop, test, and gain CAPI experience on a small scale

14

to prepare for large scale use when cost, data quality, andtimeliness, point to CAPI use.

CAPI and the June Area Frame Survey

Although this survey is considerably shorter than theFCRS, enumerators are much more likely to interviewfarmers outside. They often stand to interview as farmerswork. Enumerators also must carry a 2' by 2' aerialphotograph of the farm. Smaller computers enabledenumerators to use CAPI successfully in June, 1993.

The Blaise CAPI instrument was also coded forstatisticians to use as an IE instrument after datacollection. The IE instrument invoked more rigorous editsthen CAP!. For example, the Blaise software requires oneto "fix" edits deemed critical, but one can override non-critical errors. Programmers coded some critical IE editchecks as non-critical CAPI edit checks. Like the FCRS,enumerators handled this difficult survey well.

CAPI!IE Costs and the June Area Frame Survey

Once enumerators sent data to the State office, the editingflow was divided into four processes.

Process 1:Traditional method

1) paper interview => clerical edit => 1st statistician edit=> 2nd statistician edit => key entry => key entryverification => batch edit.

Process 2: IE with one statistician hand edit

2) paper interview => clerical edit => 1st statistician edit=> key entry => key entry verification => IE => batch edit.

Process 3: IE with no statistician hand edits

3) paper interview => clerical edit => key entry => keyentry verification=>IE => batch edit.

Process 4: CAPI

4) CAPI => IE => batch edit.

The office staff divided into three teams that rotated to editunder each of the first three processes. The officeprocessing time per interview is in the following table.The CAPI/IE combination shows the potential to reducethe officeprocess by a factor on more than three.

Page 19: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

Survey Process Times in State Office

(Minutes per "Interview")

Process 1

Clerical Edit 2.5

1st Statistician Edit 5.2

2nd Statistician Edit 3.3

Key and Verify 5.1

Interactive Edit --

Total Time 16.1

n - Interviews 757

2.4

3.9

5.0

14.7

341

3.8

5.0

13.5

391

<5

86

From observatIOn, CAPI slowed the interview within asection that comprises a large table. For other parts ofthe questionnaire, particularly parts with automatedskips, CAPI was faster. CAPI also saved time because itautomated calculations.

Enhanced software or faster computers will soon furtherspeed CAPI. Enumerators also suggested specificfeatures that programmers could use to speed CAP!.Relati ve CAPI and paper interview times depend uponthe survey, software, hardware, the intensity ofinteractive edits, and the imagination of surveyinstrument designers.

Conclusions

Although CAPI has high initial costs for hardware andtraining, it s,hows potential in reducing office workduring a peak work period. CAPI can greatly reduceoffice keying and hand editing. CAPI/IE showssubstantial gains over both the paper method and overIE without CAP!.

Statisticians edited using the IE mstrument more slowlythan they will in the future because of: 1) a learningcurve and 2) imperfections m the operationally untestedinstrument.

The IE part of process 4 was not in a production modeas processes 2 and 3 were. CAPI data were compareditem by item to the paper backup to scrutinizeenumerators' CAP!. Thus, the IE part of processes 2 and3 were conservatively extrapolated to IE for process 4.

The project did not measure some survey managementcosts such as post-collection file handling. Conversely,CAPIIIE should reduce waiting periods for batch edits,reduce time editing those batch edits, and reduce out ofpocket costs from the batch mainframe edits. Also, moresavings should come as programers, statisticians,enumerators, managers, the entlfe organization; gainsCAPI experience.

Interview Time

Differences in interview times are a negligible part ofsurvey costs. Interviewing takes a small portion of theenumerators' time. Enumerators spend much more timetraveling and finding sampled farmers. Interview lengthis a respondent burden issue, not a cost issue.

The instrument measured CAPI interview lengthsautomatically for the entire mterview and by sectionswithin the questionnaire. Enumerators do not recordpaper interview times so no comparison was made.

15

Results from the 1993 CAPIIIE June Area Framesupported key parts of previous projections of CAPIreductions 10 office work load. The unpolishedinstrument was facing its first operational test. FutureCAPIIIE development should yield even more savings.

This promising work is fledgling cost analysis, not thefinal word Cost estimates should be revaluated on anongoing basiS as the organization refines and expandsthe CAPIIIE process. The organization should use thesecost analyses as a tool to help plan CAPI use.

This work focused on a particular aspect of the CAPIdecision: cost. It did not quantitatively includetimeliness and data quality. The organization mustdecide how to weigh all considerations before choosingwhether to use CAPI.

Acknowledgments

Mark Pierz.chala and Ann Romeo were co-conspiratorswith the author in coding the June Area Frame surveyinstrument.

Page 20: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

References

Clayton, R. and Harrell L., Bureau of Labor Statistics,"Developing a Cost Model for Alterative DataCollection Methods: Mail, CAT!, and TDE". Militaryand Government Speech TechnolollY Conference.November 1989, Arlington, VIrginia.

Eklund, B. Mobile Computerized Data Entry and Editfor the Livestock Prices Received Survey. page 5, SRBResearch Report No. SRB-91-l0, Washington D.C.,USDA, National Agricultural Statistics Service,November, 1991.

Eklund, B. Computer Assisted Personal Interviews:Experience and Strate~y.page 11,STB Research ReportNo. STB-93-01, Washington D.C., USDA, NationalAgricultural Statistics Service, February, 1993.

Manners, T. "The Development and Implementation ofComputer Assisted Interviewing for the British LabourForce Survey". Computer Assisted Survey InformationCollection· International Pro2ress. Special ContributedPapers for the American Statistical Association 1991Joint Statistical Meetin~s. August 1991, Atlanta,Georgia

16

Page 21: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

History and Procedures of Objective Yield Surveys in the United States

Eric Waldhaus, Eddie Oaks, Mike Steiner, National Agricultural Statistics ServiceEric Waldhaus, NASS, USDA, Room 4162, South Building, Washington, D.C. 20250

KEY WORDS: Crop cutting, Enumerators, ObjectiveYield, Yield Surveys

This paper reviews the efforts made by the NationalAgricultural Statistics Service (NASS) in the UnitedStates Department of Agricultural (USDA) to measurecrop production by direct measurement of plantcharacteristics. NASS implies the current agency andall of its predecessors. These efforts are collectivelyknown as Objective Yield Surveys (OYS). Thisprogram is similar to Crop Cutting surveys done inmany parts of the world. The major difference is thatOYS includes non-destructive field counts prior toharvest to facilitate yield forecasts during the growingseason. Yield is defined as the weight of targeted crop,at standard moisture, produced per unit of harvestedarea. Sampling for NASS Objective Yield Surveyshave been on a probability basis from the inception.This is, however, not the implication of the word'objective' in the title. Objective refers to the directcollection of plant characteristic measurements, insteadof subjective estimates of yield reported by an observer.

NASS is a recognized world leader in the use ofobjective yield technology. Objective yield surveysproduce the primary indications for yield forecasts andestimates for the major feed and food grains in theUnited States. Additionally, NASS has made long termcommitments to make this technology availableinternationally. Through cooperative arrangementsNASS has demonstrated or helped implement objectiveyield programs in many countries of Asia, Africa, andCentral and South America.

Three aspects of NASS' objective yield program formajor field corps are considered: The history andevolution of the program, current sampling procedures,and general concepts of objective yield survey fieldprocedures. Specifically not included is a discussion ofthe use of survey data in preparing yield forecasts.This major topic is covered in another paper presentedat this conference.

HISTORY

Yield and production of major field crops in the UnitedStates have been forecasted and estimated by USDAsince President Abraham Lincoln's administration in the1860's. Crop condition surveys were prepared monthly

by the Statistics Division, USDA as early as 1863, theyear following the creation of the Department. Until1884 pre-harvest reports were in terms of condition ascompared to an 'average' crop. In 1884 the reportingconcept changed. Condition began being asked as apercent of a 'normal' crop, given no adverse effects ofweather, disease or pests.

Although crop area changes from year to year, some ofthe largest variations in crop production are caused byfluctuations in production per unit area or yield. Formore than a century, yield forecasts were based solelyon voluntary producer appraisals of expected yield. Itwas recognized early that actual changes in yield werenot fully reflected in subjective grower appraisals. By1898 travelmg agents supplemented farmer-cropreporters' information with on site observations of cropconditions. By 1903 more than 100,000 agriculturerelated business operators, including cotton ginners,millers, elevator operators, and transportation agentswere paneled to gain insight into the agriculturalsituation.

In 1910 a shift began in the practice of reporting cropcondition to forecasting actual production during thegrowing season . By 1915 cotton production forecastsbecame available during the growing season. Thetransition from condition to yield forecasts requiredregression modeling. This was almost entirely done byvisual interpretation of charts prior to the use ofcomputers in the late 1960's.

Objective measurements for forecasting yield startedwith colton in 1928. These early efforts involvedstatisticians driving along the perimeter of cotton fields,making boll counts at predetermined locations in fields.There appears to have been no effort made to relate thefield counts to yield. Thus, it may be more appropriateto think of this early effort as 'Objective Condition'surveys. Later corn and wheat were added to thisprogram, but this early effort in objective methods wasdiscontinued at the start of the World War II. Researchinto objective measurements of wheat, corn, and cottonresumed in 1954.

The 'birth' of probability sampling for agriculturalstatistics and objective yield methods came in 1957when the United States Congress funded an initiativetitled "A Program for the Development of the

17

Page 22: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

Agricultural Estimating Service". The project providedfor an annual enumeration of a large area frameprobability sample for crop area estimates. This areaframe survey evolved into the current June AgriculturalSurvey (JAS). Target crop fields identified during theJAS provide the sampling universe for the OYS (exceptwinter wheat).

Cotton and corn objective yield programs becameoperational in 1961. Wheat came on line a year later.Soybeans joined the national program in 1967 andpotatoes in the early 1970's. Grain sorghum,sunflowers and rice were added in the 1980's, but dueto budget constraints grain sorghum and sunflowerswere dropped in 1988. The rice program was reducedthen and finally discontinued in 1993.

OBJECTIVE YIELD SURVEYS OVERVIEW

NASS is organized into 45 State Statistical Offices(SSO). There is one in each state except in NewEngland where six states are combined. There is acentralized Headquarters in Washington, D.C. Sampledesign and selection, planning and coordination betweenstates, centralized data processing, and qualityassurance are the major roles of Headquarters in theOYS. Headquarters prepares and distributes twomajor OYS manuals. The Supervising and Editingmanual is focused on the tasks completed in the SSO.The Interviewers' Manual is a training and referencemanual for enumerators in the field. Each SSOcoordinates field work and other data collectionactivities independently within established guidelines.

Qualified, adequately trained field personnel, includingSSO staff and field enumerators, are essential for aquality job. States send a survey statistician, designatedthe State Survey Statistician, to a National trainingworkshop to learn and reinforce correct procedures.State Survey Statisticians return to their states to trainfield supervisors and enumerators.

Objective Yield Surveys begins with intensive trainingfor field enumerators. Training is more intensive forOYS than many other NASS data collection operations.The need for rigorous training stems from the fact thatdata collection usually is accomplished in remotelocations in the field where supervision is minimal andthere is not the opportunity to clarify procedures. It isalso recognized that data collection is often a very timesensitive process so it may be impossible to reconstructan 'interview' when errors are discovered after the fieldwork is complete. The cost of training is very high,but the need is critical. NASS has consistently

18

recognized this need and continues to make a substantialresource commitment to training.

NASS field enumerators are part time employees of theNational Association of State Departments ofAgriculture (NASDA). NASDA contracts their servicesto NASS. There are approximately 600 NASDAenumerators who work on OYS. OYS enumerators arealmost exclusively rural people, and most are fromfarm families, typically retired or part-time farmers orfarm spouses. Understanding agricultural practices isa prerequisite for a successful OYS enumerator.Enumerators also have to demonstrate literacy andcomputational skills about equivalent to a high schoolgraduate.

In addition to training, the State Survey Statistician ischarged with monitoring survey progress, and is theresource person for enumerators. Responsibly alsoextends to oversight of all SSO processing of surveydata, and supervising laboratory processes. The StateSurvey Statistician and assistants review all edit andsununary output. In most states the final yieldrecommendations (proposed estimates) submitted toHeadquarters are not prepared by the SurveyStatistician, but a Commodity Specialist.

Field Quality Control is conducted by supervisoryenumerators and statisticians from the State office. Arandom sample of each enumerator's field work isselected for personal inspection. The sample selectedfor quality control is unknown to the enumerator andthe supervisor in advance to insure an accurateassessment of quality. The sample pattern is such thatat least one quality check for each enumerator isinsured, and multiple checks throughout the surveycycle are possible. Supervisors may inspect additionalwork of the enumerators in their charge on an 'asneeded' basis.

Occasionally, deficiencies in field procedures arediscovered by the quality control process. When thisoccurs remedial action is taken, both to correct errorsin a particular sample and to re-train the errantpersonnel. Discovery of deliberately falsified surveyresults is another potential benefit of the Qualityprogram. The authors, with about twenty years ofobjective yield experience each, have no personalknowledge of this ever occurring.

Objective yield surveys are timed for making cropproduction estimates which are released to the public inthe monthly Crop Production report. Crop Production

Page 23: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

is published during the second week of the month,between the 8th and the 12th. To complete field work,process all data, and remain timely, the OYS adheres toa very rigid schedule. Data collection starts on the22nd of the month prior to the survey reference date,and must be completed by the first of the referencemonth. Laboratory work, data processing, andsummary review are completed, and recommendationsubmitted to the NASS Agricultural Statistics Board inHeadquarters by the second day before the CropProduction release.

Concepts and methodology used in the OYS forforecasting and estimating yields are similar for all fieldcrops. Two components of yield -- weight of the fruitand number of fruit -- are used to forecast a yield.Yarious plant characteristics are used to predict thesecomponents during the growing season. Harvest losses,estimated by gleaning small plots in the sample fieldsafter harvest, are deducted to obtain a net yield.

During the early growing season, crop maturity variesconsiderably by region. Plant characteristics andmeasurements made to forecast yield change as theseason and plant maturity progresses. The enumeratordetermines the maturity stage of the crop in the samplefield during each visit and makes the appropriate countsand measurements for the growth stage.

Observations for each sample are made on tworandomly selected plots (units) in each of the selectedfields. Each plot consists of a specified number ofparallel rows of predetermined length, or a rectangularunit drawn to specification if crop rows areindistinguishable.

SAMPLING

OYS samples are selected from acreage reported of thetarget crop in the March Agricultural Survey (MAS) orthe June Agricultural Surveys (JAS). Spring and durumwheat, corn, cotton, potatoes, and soybean samples areselected from the JAS. The winter wheat sample comesfrom the MAS.

Winter wheat samples are unique as they are selectedfrom the March Agricultural Survey using a multipleframe (combined list and area survey) design. Also,winter wheat varies in that samples are drawn from'fields to be harvested for grain', while other crops aresampled from fields 'planted and to be planted' on theparent survey.

19

The objective yield sample for each crop is allocated tothe most important production states such that 80percent or more of the nations crop is included.Allocations are made to minimize production estimatecoefficient of variation (CY). Until about 1990allocations were made to maintain minimum harvestlevel CY's. As estimation models have improved, aneffort has been made to allocate samples to maintain aminimum CV across the growing season.

The JAS, which is the parent survey for OYS, is themajor, once a, year, multiple frame survey conducted byNASS. Nationally, the area frame component includesapproximately 15,500 segments, each about 1 milesquare, representing about 52,500 farms which areenumerated in early June to identify land use. The areaof target crop planted is expanded by the associatedexpansion factor for the area frame sample. OYSsamples are then selected proportional to the expandedacreage. Proportional sampling insures that thedistribution of the OYS sample will approximate thedistribution of the crop as discovered in the JAS.Sampling procedures are similar for winter wheatexcept MAS is the base survey.

Survey State,;, sample size, and sample distribution arereviewed annual, but NASS has attempted to maintainconsistent State involvement and sample sizes tomaintain year to year comparability. In 1993 1,670winter wheat samples were selected in 13 States.Spring wheat samples totaled 380 in four States, and150 durum samples in one State were selected. Comsamples equaled 2,010 spread over 10 States, while1,360 Cotton samples were drawn in six States.Soybeans samples totaled 1,330 in eight States, and2,080 Potatoes samples were distributed over 11 States.

FJELD PROCEDURES

Enumerators are provided aerial photograph with thearea frame segment containing the selected sample fieldoutlined in red. Operators of land in these segmentswere interviewed during the JAS. Within the segmentthere may be more than one tract (farm). Theenumerator locates and interviews the operator of thetract which contains the selected target crop field forOYS,

Six reporting forms are used through the growingseason to collect information from the farm operator orto record counts and measurements. The reportingforms are identified by letter initials, which reflect thechronological order of use of the forms during thegrowing sea,;on. The data collected on each form are

Page 24: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

similar for all crops in the OYS program.

A convenient way to describe the field procedure forimplementing the OYS is to describe each reportingform, and explain its use.

Fonn A - is an interview form, used to update thecrop acreage intended for harvest and to identify thesample field. It shows which field (area frame) or howto select a field (list frame) that will be used for makingactual field counts and measurements. The Form A iscompleted on the first visit to the selected farm. It isalso used to gain permission from the fanner to enterthe field to set out OYS sample units, and to query thefarmer about pesticide usage so the enumerator can takeappropriate personal safety precautions.

Pesticide usage has expanded over the years both in thecrops treated and the variety of chemicals available.Consequently pesticide safety training and enumeratorexposure monitoring has become an integral part of theOYS program. This is especially true for Cotton OY,where the use of organophosphorus pesticides is nearlyuniversal.

Fonn H - also an interview fonn, is used to collectdata on seed, fertilizer, and pesticide application ratesand tillage practices. These data are used for furthereconomic analysis, and are not part of the yieldestimation program directly. It is completed at thesame time as the Form A.

Fonn B - is a field observation recording form. It isused to record counts and measurements of the plantsand fruits. This form also reiterates instructions forlocating, constructing, and processing the sample units.

The following two sections: Locating the Sample, andCounts and Measurements are presented here becausethese activities are associated with completion of FormB. A separate Form B is completed each survey monthuntil harvest time, when a final Form B is completed.

LOCATING THE UNIT:

After completing Forms A and H, the units areconstructed in the sample field by the enumerator. Twounits are laid out for each sample. Unit 1 and Unit 2are located independently of each other (except in wheatwhere unit locations are dependent). The randomnumber of rows and paces for locating Units 1 and 2are computer generated and preprinted on a label on theForm B.

The point of entry into the field, or starting comer, isthe first comer reached when approaching the field thatallows the units to have a chance of falling anywherewithin the field boundaries. The shape of the field mustbe considered to insure that the entire field has a chanceof selection. Research has indicated that there is nostatistical differences related to starting comers.Therefore, any field corner which does not excludesome part of the field is acceptable.

The following steps are followed when locating andlaying out units:

Step 1: The enumerator marks the staring comer witha piece of plastic flagging ribbon so it will be clearlyvisible on later visits.

Step 2: The enumerator then v;alks along the end ofthe crop rows the number of rows (or paces for wheatand broadcast seeded fields) indicated for Unit 1. Apiece of flagging ribbon is tied onto the first plant inRow 1. This helps locate the same row on later visits.The next row in the direction of travel will be Row 2 ofUnit 1.

NOTE: The enumerator walks his or her nonnal paceswhen locating the units within the field. It is notnecessary to measure the distance traveled as it is notnecessary to locate a precise point in the field, only onedetermined by a random process.

Step 3: The enumerator then walks the requirednumber of paces into the field between Row 1 and Row2, starting the first pace 1.5 feet outside the plowed endof Row 1. This makes it possible for a unit to fallanywhere in the field including the very edge.

Step 4: After the last of the required paces is taken, adowel stick is laid down so that it touches the end ofthe enumerator's shoe. The dowel is placed across Row1 and Row 2, at a right angle to the rows. The unit islaid out in the direction of travel of the last pace.

Step 5: The zero end of a 50 ft. tape is anchored atthe dowel stick directly beside the plants in Row 1.The sample number is written on a florist stake andinserted at the anchor point.

Florist stakes are colored lath about 6 to 8 inches long.They are highly visible markers commonly used innursery and greenhouse operations to mark seed beds.Florist stakes deteriorate quickly so no hazard will becreated if lost or abandoned in the field after the

20

Page 25: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

survey.

Step 6: In row I a starting florist stake is placedexactly 5 feet from the anchor point. It is marked"UI-RI". This measured 'buffer zone', helps insurethat the unit location is not subjectively biased in itslocation by the enumerator. The florist stake shouldbe placed beside the row about 2 inches from the baseof the plants. The marker is placed outside the plantrow to avoid any damage to the developing crop.

Step 7: Working outside the unit, the enumeratorcarefully measures the unit length and places a floriststake at the designated point. Com, cotton and potatoeshave larger unit lengths which are measured with atape. For example, the corn count area is 15 feet long.A rigid metal frame is used for marking wheat andsoybeans where the unit size is smaller. The wheat unitis 21. 6 inches.

Not all fields are square or rectangle and other specialsituations may arise when locating and laying out aunit. The Interviewers' Manual gives details on how tohandle most of these situations. Some of the problemsthat more commonly occur include: blank areas in thefield that were known or unknown during the mid-yearsurvey; the field is not large enough to accommodatethe number of rows or paces specified; row directionchanges; odd shaped fields are encountered as circularfields under pivot irrigation; fields planted on contours;or crop rows that are not distinguishable due to sowingpractices. These situations are covered with preciseinstructions.

The Form B is the recording form for counts andmeasurements that are made at the units. Visits tothese sample units will take place monthly during thegrowing season except for potatoes, when only one visitis made within 3 days of harvest or when vines aredead.

Because the same sample unit must be revisited monthlyit is important the enumerator precisely mark thelocation of the unit. Plastic flagging ribbon is used.This is highly visible, but like the florist stakes, quicklydisintegrates so it may be abandoned after the survey.

COUNTS AND MEASUREMENTS

Step 1: Measure I-row space and then 4-row spaces.Measurements are made from the plants in row I torow 2 and then from row 1 to row 5. These

21

measurements are used to calculate area of the unit.

Step 2: Count the number of plants in each row in thedesignated unit.

Step 3: Classify the unit by maturity category.Descriptiw four page handouts with color pictureexamples are helpful in determining maturity.

Step 4: Make the specific counts and measurements ofplant characteristics required. Different counts aremade depending on the maturity level category. Thecrop and type of counts are as follows:

Soybeans: I) plants; 2) nodes; 3) lateral branches withblooms, dried flowers, or pods; 4) blooms, driedflowers and pods; and 5) pods with beans.

Corn: I) plants; 2) average length of kernel rows; 3)diameter of ear; 4) stalks with ears or silked ear shoots;5) number of ears; 6) ears with kernel formation; and7) cob length: and 8) field weight of corn.

Cotton: I) plants; 2) burrs, open and partially openedbolls; 3) large unopened bolls; 4) small bolls andblooms; and 5) squares.

Wheat: 1) stalks; 2) heads in late boot; 3) emergedheads on all stalks; and 4) detached heads.

Potatoes: 1:, hills; 2) tubers; and 3) field weight oftubers in the unit.

After completing Unit 1 counts and measurements goback to the beginning of the Row 1, and walk to thedesignated row, or number of paces, for Unit 2.Continue in the original direction of travel as whenlocating Unit I if Unit 2 count exceeds the Unit Icount. After locating the Row 1 of Unit 2, walk therequired paces into the field to set up Unit 2, and makethe counts and measurements required.

A Form B :,s done for each month until very nearharvest. Close contact is made with the operator so asample field will not be harvested before a final FormB (just before harvest) can be completed. During thislast visit before the farmer harvests, a sample of maturecrop is sent to the laboratory. This sample is the basisfor at harvest yield estimates.

FORM C-l and C-2 - These forms record laboratoryobservations, and are not seen by the field enumerator.Form C-l ft~cords data from pre-harvest field visits,

Page 26: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

while the C-2 is generated from the last field visit madeat, or just before the farmer harvest.

FORM D - is used to record the actual number of acresharvested at the end of the year and the operatorestimated yield of the field.

FORM E - is a field observation form used to collectdata for determining field harvest loss so a net yieldestimate can be made. The field visit to collect datamust be within 3 days after harvest to determine harvestloss accurately as loose grain deteriorates quickly or islost when left in the open. Harvest losses aresubtracted from gross yield to arrive at a net yield.Finding the location of this post-harvest unit is similarto the original unit location. A measured rectangle isstaked out and fruit from the crop is collected, and sentto the lab. There it is counted, weighed, and moisturetested to determine the field loss.

NON SAMPLING ERROR

Controlling non-sampling error is a major concern ofthe OYS program as in any large scale sampling surveyproject. Cause for OYS non-sampling error can bedivided into two major categories: faulty procedures,and faulty procedure implementation. Additionally, asOYS use sub-samples from other surveys, non-samplingerror present in the parent survey is passed on ormagnified. This is out of the control of the OYSpersonnel except to monitor the larger survey forconsistency. This source of error will not beconsidered further herein.

Non-sampling error which are the result of faultyprocedures can be dealt with in a straight forwardmanner. The NASS research unit continuously reviewsvarious aspects of the OYS program to insure surveyvalidity. Validation surveys are conducted for eachcrop on a rotational basis. These surveys explore manyaspects of OYS, such as the independence of thestarting comer as noted earlier.

The survey quality program is also useful in discoveringfaulty procedures. Most often procedural difficultiesthat are discovered in the quality control program relateto some 'special case' which was not adequatelyconsidered when preparing manuals. Instructionchanges that clarified selecting starting corners that donot exclude some part of the field developed largelythrough this route.

Insuring that procedures are consistently and accurately

22

followed across the country is the greatest challenge incontrolling non-sampling error. The most importantcontrol for non-sampling error is training. OYStraining is continuous. The training cycle starts withtraining for State Survey Statisticians at Nationalworkshops. Usually there are three held in a year, onefor Wheat, another for Com, Cotton and Soybeans, andthe third for Potatoes. Com, Cotton, and Soybeans arecombined for training because procedures, growingseasons, and States involved largely overlap.

Training continues with workshops for fieldenumerators, conducted by the State Survey Statistician.Assistance from the Headquarters OYS unit is availableto the SSO's in conducting local training. This can bean important resource for a new State SurveyStatistician, and gives Headquarter personnel theopportunity to observe local operations.

The formal quality control program in which individualenumerators have work inspected at random is animportant part of the NASS non sampling error controlprogram. While the potential is in place to discover anenumerator who is intentionally falsifying reports or'table topping', this is not a major concern. The realvalue of the quality control program is to assess thelevel and effectiveness of training. Another importantbenefit of the program is its moral boosting effect onenumerators. The normal out come of quality controlis that the enumerator is 'caught doing it right'. Whenfed back to the enumerator in a positive way this can beexcellent reinforcement for continued quality fieldwork.

Page 27: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

YIELD MODElS FOR CORN AND SOYBEANS BASED ON SURVEY DATA

Thomas R. Birkett, National Agricultural Statistics Service, USDAUSDAINASS, Rm.4813-South, Washington, D.C. 20250

Abstract

The National Agricultural Statistics Service uses surveydata to forecast yields for major agriculturalcommodities, including com and soybeans. The surveydata contains variables that become the independentvariables in linear forecasting models. This paperdescribes the forecasting models, showing what the keysurvey variables are and examining how they are relatedto final yield.

Introduction

The National Agricultural Statistics Service (NASS), anagency of the United States Department of Agriculture,conducts monthly field surveys in the late summer andfall to forecast com and soybean yields. Summarizeddata from the survey. forms the independent variablesfor a statistical model that predicts the current seasonfinal average yield. The survey data include variablescorrelated with the final average number of ears or podsthat will be harvested, along with variables correlatedwith the final average grain weight per ear or weightper pod. This paper gives a short description of thesevariables and how they are used to forecast finalaverage yield.

Description of the Objective Yield Surveys

In June, NASS conducts a very large survey ofagricultural1and use in the U.S. to estimate the currentseason's acreage planted to com and soybeans. Fromthe base generated by this survey, NASS draws arandom sample of com and soybean plots. This is donethrough a two stage process, in which fields areselected and then random locations are designatedwithin each selected field. The procedure is carried outso that a simple random sample is obtained, and eachplanted acre of com or soybeans has an equal chance ofbeing included in the sample. This simple random

sample propeny is an important assumption for thestatistical models to be applied to the survey data.

The randomly located plots are a few square feet inarea. Within the plots, enumerators count and measurevariables that are positively correlated with final yield.Among the variables collected for soybeans are numberof plants per acre, number of nodes per plant, numberof lateral branches per plant, number of blooms, driedflowers and pods per plant, and number of pods withbeans per plant. For com the NASS enumerators countthe number of stalks per acre, number of stalks withears, number of ear shoots, and number of ears withkernels per acre. They also husk a random sample ofears near the plot and measure the length of a typicalkernel row on each ear. Just prior to farmer harvest ofthe com or soybean field in which the sample islocated, the enumerator harvests the plot and obtains thefinal yield. The same sample plots are revisited eachmonth starting in August until farmer harvest.

Samples are laid out in all the major com and soybeanproducing states. Data are collected during the periodfrom the 21st of the previous month until the first of themonth. Starting in August and continuing throughNovember, around the 10th of each month the USDAreleases yield estimates for each state based on thesurvey.

Variables in the Regional Models

The best relationship between the survey data and finalyield is found at the regional level, the region being theset of states in the survey. Consequently, the plot leveldata is summarized to the state and then to the regionlevel, where it is modeled against the region yield.Each monthly regional model normally has oneindependent variable X.

The form of the regional linear model is either

23

Page 28: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

y = U + ~x + EMaturity Adjustment

or

where

Y = average regional yield and

a, tj's are unknown model parameters.

X is the known independent variable, and

E is the difference between Y and its expectedvalue.

In the examples used in this paper, the soybean modelhas the quadratic term while the com model is limitedto the linear term.

While NASS conducts the survey during the last tendays of each month, the overall maturity of the crop atthat time will vary from year to year, depending onwhen it was planted, subsequent weather, etc. Theforecasting power of the model is enhanced byclassifying each plot by stage of maturity and limitingthe independent variable calculations to data from pre-selected stages. This adjustment allows the independentvariables to be more comparable across years.Variables not used directly in X (such as nodes andblooms, dried flowers and pods) are used for maturityclassification. Consequently, the predictor variable isnot a function of all the data, but only those plots in astage that has exhibited good predictive power for finalyield. This criteria normally means the exclusion ofvery immature samples in the first month of the survey.After that the vast majority of the samples are useddirectly in X.

Relationship of the nwnber and weight variables tofinal average yield

A plot of the data in the September I com regionalmodel is shown below. (The digits plotted represent theyears 1980-1992).

As mentioned above, the survey variables are selectedto correlate with the components of final yield, whichare number of ears or pods and weight per ear or pod.H is quite illuminating to 1iew the 3-dimensionaldistribution of final yield and the factors of theindependent variables to see how they explain the yield

•....• -li2

2.9 3.0 3.1 32 3.3 3.4 3.5 3.6e.. x Kernel Row L.-.gth

September 1 Com Model

140

130

Y120I

•I 110d

100

CORN VARIABLES BY MONTH

(stalks with ears + ears withAugust kernels per acre) X (average kernel

row length per ear)

September (Ears with kernels per acre) X(average kernel row length per ear)

October- (Ears with kernels per acre) XDecember (average grain weight per ear)

SOYBEAN VARIABLES BY MONTH

August estimated nwnber of lateralbranches per acre

September Estimated number of pods withbeans per acre

October- (estimated number of pods perDecember acre) X (net weight per pod)

The values for X for com and soybeans are shown inthe following tables.

24

Page 29: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

level. Since the independent variable in the model canusually be factored into the product of a variablecorrelated with final weight and one correlated withfinal number, we can plot the fitted model surface overthe weight X number plane. The projection of selectedlevels of the fitted yield surface onto this plane is easierto analyze. An October example for soybeans and aNovember one for com are shown below.

each other to produce approximately the same yield,even though rl1e weight and number variables arevarying quite widely. The heaviest average weightoccurred in 1985, but it had drought-like numbers ofpods. At the other extreme, 1987 bad the lowestOctober 1 weight of the normal years, but its podcounts were the second highest. 1992, which has therecord yield to date, bad the highest number of pods onOctober 1.

Soybeans - October 1, 1983-1992

This graph contains a great deal of information aboutsoybean yields. The years divide into two distinct

Ill'

222t2

•• __ ,•• -1'1 -In

,,,,,,,,

"14.', " •.',.,,,

13.

Com - November I, 1980 - 1992

0.27318113

0.303

.0.362• eo"t

0.332E

•r

Since the surface is based on a model with a quadraticterm, one can see the spacing between the contoursincreases as the yield level increases. This implies thatthere are diminishing increases in yield as the averageweight and numbers increase. Also, since the contoursare at roughly 45 degree angles, one can deduce thatincreases in weight or numbers will increase yield.However, this is survey data, and numbers and weightdo not vary independently (they vary inversely) so anincrease in one will normally be associated with adecrease in the other and vice versa.",.,,,,,

"""","",

III

It -+- II --JI - It

1239 1327 1414 1501

Pods per 18 •• ft

13.

" ...'..

0.3800

0.31101152

p 0.3485•r

w•I 0.3673•Itt

••• 0.3297 ~d

For soybeans, the weight per pod is in grams, and theyield contours projected from the fitted model surfaceonto the plane are 27, 30, 33, 36 and 39 bushels peracre. On October I, usually about balf the crop hasbeen harvested, and the weight is for just thoseharvested samples. The pods per 18 square feet is forall samples as of October 1.

groups, with 1983, 1984 and 1988 in the lower leftcorner, and the remaining years distributed along the36 to 39 bushel contour region. The years 1983, 1984and 1988 were severe drought years in the com belt,and both pod counts and weight were depressed to thepoint that yields averaged around 30 bushels. In theremaining years, conditions were more normal, andaverage yields were generally around 36 to 39. So farthere has not been a year where weight and numbers ofpods were simultaneously near record levels. There isan obvious negative correlation between average weightand number. The two variables interact inversely with

For com, u~y about two-thirds of the crop isharvested by November 1. The grain weight, inpounds, is jU';t for the harvested samples. The earcounts are for all samples as of November 1. Theprojected yield contours from the fitted surface are 78,88, 98, 108, 118, 128 and 138 bushels per acre.

Here we see the two drought years, 1983 and 1988, inthe lower left comer . Th~re appears to be lessdependence b<~tweenthe weight and number variablesfor corn than there was with soybeans. Some years,

25

Page 30: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

such as 1985, 1986 and 1987 are pushing the limit onboth ears and weight. In 1992, ear density increaseddramatically, while the ear weights mllintllined anaverage level for non~ought years. 1992 set a newrecord for yield by a large margin, driven by the largeear counts.

Since the com model has no quadratic term, the spacingbetween the contour levels is constant. The 45 degree

. contours indicate both weight and numbers drive finalaverage yield~ If conditions are generally good, it ispossible to have both large ear counts and aboveaverage weights in the same year, something that is notgenerally seen with soybeans.

Conclusion

Average com and soybeans yields can be predicted byobserving variables that are correlated with finalnumbers and weights. In com both counts and weightscan be high at the same time, producing record yields.With soybeans, however, final counts and weights areinversely related, producing relatively constant averageyields in non~ought years.

References

Birkett, T.R. (1990), wThe New Objective YieldModels for Com and SoybeansW

, National AgriculturalStatistics Service, 5MB-90~, Washington, DC, 20250.

SAS Institute Inc., S{\S/GRAPH Software: Reference,Version 6, First Edition, Volume 1 and 2, Cary, NC:SAS Institute, Inc., 1990.

Searle, S.R. (1971), Linear Models, New York: JohnWiley & Sons, Inc.

The author can be contacted at

NASS/USDA, Room 4813,14th and Independence, SW,Washington,DC, 20250

202-720-5359

26

Page 31: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

USING DIFFERENT PRECIPITATION TERMS TO FORECAST CORN AND SOYBEAN YIELDS

M. Denice McConnickUSDA/NASS/Research Division/3251 Old Lee Hwy., Room 305, Fairfax, VA. 22030

KEY WORDS: Precipitation, regression models method was also used to aggregate the precipitationvariables.

INTRODUCTIONDATA

where

Precipitation Data

Precipitation variables used in the modelsrepresent total precipitation for a particular month at theregional level. The data are provided from a networkof National Weather Service weather stations in eachState. The variable is constructed as follows:

the acres for harvest for year t,State s, district d, andthe number of districts per State s,the average total precipitationwithin selected month for year t, States, district d,

sEA~u

P = ,-I (1), sEAcs,:1

the average total precipitation withinselected month for the region for yeart,the number of States covered,the acres for harvest for year t,State s, andthe average total precipitation withinselected month for year t, State s,

where

D.Eud

In 1990, the National Agricultural StatisticsService (NASS) introduced new models to forecast yieldfor com and soybeans on the regional and State levelsin a plan to phase out the older, less accurate models(Birkett 1990). An annual survey collects data fromrandomly selected sample plots in randomly selectedfields. The old regression models predicted thecomponents of yield such as number of pods per plantand weight per pod at the plot level based on five yearsof previous data. Plot level data were then aggregatedto the State level. The new models are also regressionmodels, and have initially been developed to predictyield directly rather than the components of yield usingsurvey data aggregated to the regional level. Regionsare constructed from the set all States that participate inthe annual survey. A longer period of years in thehistoric data set must be used since only one data pointis used to represent each year.

McCormick and Birkett (1992) tried to improvethe accuracy of early season soybean yield forecasts byadding a term that represented total accumulatedprecipitation throughout the growing season from April1 until the forecast date at a six~State regional level.The analysis indicated that soybean forecast accuracy atthe regional level was not improved using this particularterm. Based upon this result, two recommendationswere made. One was to evaluate alternative time frameterms, such as monthly precipitation totals. The otherwas to use them to forecast other major agriculturalcrop yields. This paper reports results when separatemonthly precipitation terms were added to corn andsoybean yield forecast models. It considers data forthirteen years, 1980 to 1992. The soybean Statesincluded in the study are Arkansas, Illinois, Indiana,Iowa, Missouri, Minnesota, Nebraska, and Ohio. Thecorn States are Illinois, Indiana, Iowa, Michigan,Minnesota, Missouri, Nebraska, Ohio, South Dakota,and Wisconsin. The performance of each model iscompared to official operational model performance.

This study evaluates multiple regression modelswhich use precipitation and survey variables to forecastend-of-season crop yields. In previous research, themodels showed improved performance using aggregatedsurvey variables at the regional level. Therefore, this

27

Page 32: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

where

number of weather stations for year t,State s, district d, andtotal precipitation within selectedmonth for year t, State s, district d,weather station w.

Survey Data

average kernel row length per square foot. Cli issubstituted for F•• in equation (2). In August, it iscalculated as:

•••1 -C'" = - L (Utsj +Vrs)K",j ,m", J. 1

where

The construction of the independent variablesfor the regional regression models for both soybeansand com is discussed by Birkett (1990, 1993). Forsoybeans for the month of August, the independentvariable (Z.) is the estimated number of lateral branchesper eighteen square feet. For September, theindependent variable is the estimated number pods withbeans per eighteen square feet. These regional-levelestimates for soybeans are constructed as follows:

In September, Cli is calculated as:z =t

where

sL Atr F",

$ • 1

S

L A",$ ·1

(2)

C••

U••j

V••j =

~j =

a function of the number of stalks withears, the number of ears withkernels, and the average kernel rowlength per square foot,number of stalks with ears per sq. ft.,year t, State s, sample j,number of ears with kernels per sq.ft., year t, State s, sample j, andthe average kernel row length per ear,year t, State s, sample j.

the acres for harvest for year t,State s, andmlmber of lateral branches per 18 sq.feet year t, State s,

For both forecasts, data are used from thesubset of samples in maturity categories 3-6 for year t,State s.

Yield Data

The regional yield values included in this studywere calculated as follows:

Com independent variables (ZJ are morecomplex as they are a function of both plant counts and

Regression analysis was used to evaluate theperformance of precipitation data in combination withsurvey data. Multiple linear regression models withassociated diagnostics for model fit and forecast

sL A", Ytr

y = $·1 (3)t sLAcs

1·1

where

m.. =

J•• =

B••j =

~j =

the number of samples in J•• year t,State s,the subset of samples classified inmaturity categories 2-6 (or 1-6 insouthern States), year t, State s,plants per 18 square feet for year t,State s, sample j,lateral branches per plant year t,State s, sample j (for August) orestimated pods with beans perplant per 18 sq. feet, year t, State s,sample j (for September).

where

Y,Y••

final regional yield for year t, andNASS State yield year t, State s.

METHODOLOGY

28

Page 33: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

accuracy were examined. The basic regression modelsanalyzed were: 1

SD(Y.,)=s[(x,,'(X:X.,)-lX.,) + 1] 2,

Model 2 is the official model used by NASS toforecast August com and soybeans and Septembersoybeans. However, Modell is the official model usedto forecast September com. Models 3 and 4 use onemonthly precipitation term. Analysis was conducted todetermine which month from the growing seasonprovided optimal forecasting capability. Also, modelswith multiple monthly precipitation terms wereexamined.

Model Evaluation Criteria

s = (residual MSE)If2,XO relevant p-dimensional row

vector of independent variables for yearo (for example, in Model 3: p= 3,

Xo [1, ZO, PJ),x., = relevant (n-l x p) matrix of

independent variables (excludes xJ,n number of years, andp = number of parameters.

The Xo matrix excludes the row vector Xo, sothat the PI reflects the accuracy expected in anoperational model where current year data are notincluded in the model development. A significancelevel ofO.32 was used for this study, which provides tvalues near 1.0. Consequently, the future Y will fallwithin the calculated PI of the predicted Yapproximately 68% of the time.

2. R.2 is used as a goodness-of-fit test foreach model with an adjustment made for theconesponding degrees of freedom (Draper andSmith 1981).

The primary model evaluation criterium is theset of prediction intervals (PI) for the minimum,median, and maximum yielding years over 13 years inthe study. For soybeans, these years were 1988, 1981and 1990, and for com, they were 1983, 1989 and1992, respectively. A second criterium is the adjustedcoefficient of determination, R/ which provides ameasure of correspondence between predicted andactual yields. Both the PI and R/ are based on the sumof squared differences from the least squares analysisused to derive the model parameters.

Outlier Identification

R/ is calculated as:

1 __(RS_S~_f(_n_-_p_)(CTSs)f(n - 1)

the residual sum of squarestaking the changing Dumber ofparameters into account,the corrected total sum ofsquares,the number of years, andthe number of parameters.

crss

where

np

The prediction interval (PI) refers to half theconfidence interval length for the predictedvalue of a future Y for a given future year o.That is, at the a significance level,

1.

where

a 4

PI = t(1-"2;n-l-p)SD(Y.,), Since the purpose of the models is to makeforecasts, lhe rstudent statistic (also called thestudentized residual) was used to help identify outliersto be excluded from the model. This statistic wasrecommendt~ in Belsley, Kuh and Welsh (1980). It issimilar to the standardized residual:

29

Page 34: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

RESULTS

Here, s is replaced by s(i). S(i) is the estimateof a with the ilh observation deleted. In aforecasting model, rstudent measures how manyprediction standard errors the forecast is from theobserved Y. Observations with absolute values ofrstudent greater than 3.0 were identified as outliers.The rstudent statistic is distributed closely to the t-distribution with n-p-l degrees of freedom.

Regression analysis was conducted on anumber of different models using different monthlyprecipitation terms. Tables 1 and 2 present theprediction intervals and R/ for the official linear orquadratic model using survey data only and then resultsadding the optimal monthly precipitation term. In bothtables, the prediction intervals relate to the years withminimum, median, and maximum regional yields.

Table 2: September ResultsModel R,,2 Prediction

Intervalsmin med max

CORN:Official .97 3.6 3.2 3.4PI=June .98 2.3 2.0 2.2

SOYBEANS:Official .89 1.7 1.6 1.6PI= August .88 1.9 1.7 1.9

BmLlOGRAPHY

Note: September corn: Official model removed 1990; Precip model

removed 1988.

CONCLUSIONS

Except for the September soybean forecast, theprecipitation models performed better than the officialforecast models since their prediction intervals wereconsistently smaller. Contrary to previous indications,the August forecast models demonstrated that theaddition of a monthly precipitation term with a surveyterm does improve forecasts for both crops. For bothperiods, the com forecast seemed to benefit thegreatest. There is no evidence that a change from theofficial model is warranted for September soybeans.

,.Irsi =

ilh residual,(residual MSE)I/2, and

'(X'X)-:Xi Xi'

where

r·1s

Belsley, David A, Kuh, Edwin, Welsh, R.E., (1980),Regression Diagnostics, John Wiley & Sons.

Table 1: August ResultsModel R,,2 Prediction

Intervalsmin med max

CORN:Official .87 7.0 5.7 6.2P,=July .93 5.4 4.3 4.9

SOYBEANS:Official .70 2.8 2.3 2.7P,=July .74 2.3 2.1 2.3

Note: August corn: both models have outlier year 1988 removed.

Birkett, Thomas R., (1990) "The New Objective YieldModels for Com and Soybeans", 5MB Staff ReportNumber 5MB-90-02, USDA.

Birkett, Thomas R., (1993) "Yield Models for Com andSoybeans Based on Survey Data", USDA, Proceedings,ICES Conference, Buffalo, New York.

Draper, N.R., Smith, H., (1981), Applied RegressionAnalvsis, John Wiley & Sons Second Edition.

McCormick, M. Denice, Birkett, Thomas R. (1992)"Evaluating the Addition of Weather Data to SurveyData to Forecast Soybean Yields", SRB ResearchReport No. SRB 92-11, USDA.

30

Page 35: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

AN EMPIRICAL INVESTIGATION OF AN ALTERNATrVE NONRESPONSE MODELFOR THE ESTIMATION OF HOG TOTALS

Matt Fetter, National Agricultural Statistics ServiceUSDA/NASS/Research Division/3251 Old Lee Hwy., Room 305, Fairfax, VA. 22030

KEY WORDS: Nonresponse model, reweighting, bias

1. THE SURVEY

The National Agriculture Statistics Service (NASS)conducts the Quarterly Agriculture Survey (QAS) tocollect data on cropland acreage, grain storage, andvarious livestock items including hogs. The QAS isa "multiple frame" survey. Two independent framesare sampled, the "list" frame and the "area" frame.The "list" frame is a list of farn1 operations acrossthe U.S. that NASS maintains. The "area" frame iscomposed of all land in the contiguous U.S. TheQAS estimate for an item of interest is constructedby adding the estimate obtained from the list framesample with the estimate obtained from the farmoperations in the area frame sample that are not onthe list frame. All estimators of interest in thispaper are list frame estimators, thus the area frameportion of the multiple frame estimate will not bediscussed further.

The QAS list frame is sampled using a stratifiedsimple random sample design. Stratification isbased primarily on each unit's control data for hogs,grain storage capacity, and acreage. A priorityscheme is used to place each unit into exactly onestratum. The resulting stratification is not optimalfor anyone particular item of interest. For example,one stratum might be composed of units havingsimilar grain storage capacity, but their hogcharacteristics might be quite different. Anotherstratum might be composed of units having similarhog characteristics but very different croplandacreage, and so forth.

Total nonresponse for QAS list frame samplestypically range from 10 to 20 percent. Even fortotal nonrespondents there is often someinformation known about the sampled unit. Forexample, the interviewer or enumerator may be ableto determine that the sampled unit is in business.Sometimes the presence or absence of hogs can bedetermined even though the actual number of hogsmay be unknown. This partial information can beused to reduce nonresponse bias.

31

With the exception of certain self-representingstrata, NASS currently uses sampling weightadjustment procedures (reweighting) to reducenonresponse bias in its estimate of list frame hogtotals. Two different estimators are used to modelnonresponse. The first assumes that thenonrespondents can be reasonably well representedby the respondents. This is a strong assumptionand its validity is seriously questioned. Theestimator that is based on this assumption is not ofsignificant ::nterest and will not be formallydiscussed here. The other estimator is based on amodel that uses the hog presence/absenceinformation that is available on somenonrespondents. This estimator will be referred toas the Adjusted Estimator.

The Adjusted Estimator, developed by Crank (1979),was designed to take advantage of all partialinformation that was available on nonrespondents.At that time, the QAS was only designed to captureinformation regarding the presence or absence ofhogs for nonrespondents. Currently, the QAScaptures information regarding in/out of businessstatus (ag-status) for nonresponding units. Cox(1993) described an alternative estimator thatincorporates this additional information into thenonresponse model. The purpose of this paper is todescribe this alternative estimator (referred tohenceforth as the Revised Estimator) and toinvestigate the effect it has on the level of theestimates produced by the Adjusted Estimator.

The Revised Estimator, was applied to historic dataso that a direct comparison of the two estimatorscould be made. Five major hog producing states(Georgia, Illinois, Indiana, Iowa, and NorthCarolina) were chosen for this purpose. TheRevised Esti:nator was applied to 15 consecutiveQAS surveys (June 88 - December 91) for eachstate. A comparison of the two estimates could thenbe made for each state in each quarter.

Page 36: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

2. WEIGHTING CELLFORMATION

A weighting cell is defined as a group of sampledunits within which nonresponse adjustments arecomputed and applied to the sampling weights. Ifthe propensity to respond is linked to certain hogcharacteristics of the sampled units, it is desirablethat weighting cells be composed of units that aresimilar in these characteristics. Under theseconditions, all units within a weighting cell wouldbe equally likely to respond. Thus the respondentswould be representative of the nonrespondents andnonresponse bias would be minimal.

For the Adjusted Estimator, the weighting cells arethe design strata. Because the stratification of thelist frame is not optimal for hog estimation, designstrata are not the most efficient cells for computingand applying nonresponse adjustments. Thusrespondents are less likely to be representative ofthe nonrespondents within these cells. Through theuse of poststratification, it is possible that improvedweighting cells can be defined.

3. THE NONRESPONSE MODELS

In order to claim that a reweighted estimator isunbiased in the presence of nonresponse, someassumptions must be made about thenonrespondents. If all other factors are consideredequal, the estimator based on the most sound set ofassumptions would be judged as the estimator ofchoice for the reduction of nonresponse bias.

When considering the form of these estimators, itwill be helpful to think of the estimation procedureas consisting of a sequence of three specific steps.For each sampled unit, three determinations need tobe made. These are:

1) the sampled unit's status as an agriculturaloperation (ag-status). [(Is the unit in business orout of business)? This determination is onlyapplicable in the case of the Revised Estimator.]

2) the sampled unit's status as a hog operation(hog-status). (Does the sampled unit raise hogs ornot)?

3) the sampled unit's status as a hog-totalrespondent (hog-total status). Os the number ofhogs associated with the sampled unit known)?

A complete respondent will be defined as a sampled

32

unit for which the number of hogs associated withthat unit is known. A nonrespondent will bedefmed as any sampled unit for which anyone ofthe above determinations can not be made.

In order to compare the nonresponse modelsimplied by the estimators considered here, theunderlying assumptions must be understood. Ateach modeled level of nonresponse, a validassumption concerning the nonrespondents isrequired to claim that the estimator is unbiased inthe presence of nonresponse.

The Adjusted Estimator adjusts for nonresponse attwo levels, the hog-status level and the hog-totalstatus level. Therefore, one assumption concerningthe nonrespondents at each level must be valid. Forthe hog-status level, the required assumption is:

AssumPtion lA. The probability that hog-status willbe determined is the same for all sampled units ina particular stratum. This implies that hog-statusnonrespondents represent a simple random sampleof the stratum population.

For the hog-total status level the requiredassumption is:

AssumPtion 2A. Within a stratum, amongst all unitswhich have been determined to be hog operations,the probability that the number of hogs associatedwith that unit will be obtained is the same for eachunit. This implies that within a stratum, hogoperations that are complete respondents representa simple random sample of all sampled units whichhave been determined to be hog operations.

If N(h) represents the stratum h population size andn(h) represents the stratum h sample size, theAdjusted Estimator can be expressed in thefollowing form at the stratum level:

2 n(he)

Y(h) - WSGIIIp(h)Alwl_ih) L Ahol_uihe) L y(hel).-1 i-I

(1)

where:

Y(h) represents the estimated number of hogs instratum h.

Wsamp(h)= N(h) / n(h).

Page 37: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

Ahog_sr(h)= n(h) / nhog-srresp(h), the hog-statusnonresponse adjusnnent for stratum h,

where:

nhog-srresp(h) represents the number of hog-statusrespondents in stratum h.

Ahog_wr(he)= nhog-srresp(he) / ncomp-resp(he), thehog-total status nonresponseadjustment for weighting class e instratum h,

where:

nhog-srresp(he) represents the number of hog-statusrespondents in weighting class ewithin strataum hand,

ncomp_resp(he)represents the number of completerespondents in weighting class ewithin stratum h.

y(hei) represents the number of hogs reported bycomplete respondent i in weighting class ewithin stratum h.

n(he) represents the number of units in class e instratum h.

The subscript e denotes two distinct sets (classes) ofhog-status respondents in stratum h; hog operationsand non-hog operations. Once a sampled unit isidentified as a non-hog unit, the number of hogsassociated with that unit is immediately known tobe zero. Thus all identified non-hog units arecomplete respondents. Let e= 1 denote this class.For this class, there is no nonresponse at the hog-total status level. Thus:

Ahog_ror(h1)= 1 since:nhog-srresp(h1) = ncomp_resp(hl).

For the hog operation units (e=2), Ahog_wr(h2)mustbe expressed in the general fom1 stated above.

The Revised Estimator adjusts for nonresponse atthree levels, the additional level being the ag-statuslevel. For each of the three levels, one validassumption is required for the estimator to beunbiased. These assumptions are:

status nonrespondents can be thought of as arandom sample of the cell population.

Assumption 2R. Within a particular weighting cellcomposed of identified ag-operations, the probabilitythat hog-status will be determined is the same forall units comprising that cell. This implies that hog·status nonrespondents can be thought of as arandom sample of the units composing the cell.

AssumPtion 3R. Within a particular weighting cellcomposed of identified hog operations, theprobability that hog-total status will be determinedis the same for each unit in that cell. This impliesthat the hog-total status nonrespondents can bethought of as a random sample of the unitscomposing the cell.

The assumprions on which these estimators arebased are likely to be invalid unless the weightingcells are judiciously defined. In order to increase thelikely validity of the underlying assumptions of theRevised Estimator, it was desirable to define theweighting cells in such a way that they would becomposed of units having similar hogcharacteristics. Poststratification based on eachunit's hog control data was used to form weightingcells. Thus the weighting cells were definedsimilarly to the way that design strata would bedefined for a hog-specific survey. In order to furtherincrease efficiency, the weighting cells (post-strata)were defined to insure that approximately 20complete respondents would be contained in eachcell. (The Adjusted estimator is not implemented insuch a way as to insure reasonably high numbers ofcomplete respondents).

Because the weighting cells cut across design strata,the Revised Estimator will be expressed at the finalnonresponse adjustment cell level, e, e= I, ...,E. Thegeneral form of the Revised Estimator is:

n.

Y(e)-Ahog_ro,(e) L Wsamp(el) Aps(ez)i

(2)where:

AssumPtion 1R. The probability that ag-status willbe determined is the same for all sampled units ina particular weighting cell. This implies that ag-

33

Y(e) represents the estimate of the total forhog-total status weighting cell e,

Page 38: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

y(ei) represents the number of hogs reported byunit i in weighting cell e.

ne represents the number of sampled unitsin weighting cell e,

Wsamp(ei) represents the sampling weight for theith unit in weighting cell e,

represents the poststratificationadjustment for the ith unit in weightingcell e,

~g_S[(ei) represents the ag-status nonresponseadjustment for the ith unit in weightingcell e,

Ahog_st(ei) represents the hog-status nonresponseadjustment for the ith unit inweighting cell e, and

Ahog_to[(e) represents the hog-total statusnonresponse adjustment for the ith unitin weighting cell e. (Note all hog-totalstatus respondents have the same hog-total status adjustment within class e).

All of the nonresponse adjustments have the usualform:

L W·sampall :sam.pW Wlia € etU

L W· sampall responding Wlia € ct!U

where W* represents the sampling weight or anadjusted sampling weight, depending on the level ofthe adjustment. All nonrespondents have anonresponse adjustment of zero by definition.

The poststratification adjustment has the followingform:

A (el) _ N(g)Ps L Wsamp(gl)

i€g

where N(g) represents the number of units on thelist frame that fall in poststratum g and Wsamp(gi)is the sampling weight for the ith sampled unit inpoststratum g.

34

4. THE VALIDITYOF THE ASSUMPTIONS

Although the assumptions implied by the AdjustedEstimator are reasonable, they are not beyondjustifiable criticism. As stated earlier, assumption1A asserts that within a stratum, all sampled unitsare equally likely to be hog-status respondents.However, if the partial information concerning ag-status is considered valid, then the original samplecan be divided into three mutually exclusive groups:1) those units for which ag-status is not determined,2) those units identified as non-ag units, and, 3)those units identified as ag units. All units in thefirst group have a zero probability of having hog-status determined because hog-status determinationimplies ag-status determination. Clearly, hog-statusdetermination is certain for all units in the secondgroup because all non-ag units have zero hogs.Therefore, one could argue that it would bedesirable to augment the nonresponse model so thatthe probability of determining ag-status is the samefor all sampled units, while the probability ofdetermining hog-status is the same for all sampledunits which are known to be ag-operations. If valid,this argument would imply that the AdjustedEstimator is based on a misspecified model.

If the Adjusted Estimator is based on a misspecifiednonresponse model, it is of interest to understandthe effect that this misspecification is having on theestimates of hog totals. First, an argument for thenature of the misspecification will be presented.Second, the effect of this misspecification on thelevel of the estimate will be described.

All ag-status nonrespondents are either: 1) non-agunits (out of business), 2) non-hog ag-operations, or3) hog operations. Because every unit in thepopulation must be one of these types, it isreasonable to assume that ag-status nonrespondentsrepresent a random sample of the cell (stratum)population. However, the Adjusted Estimator isbased on the stronger assumption that the hog-status nonrespondents as a whole represent arandom sample of the cell (stratum) population (seefigure 1). For a moment, let us assume that thisassumption is valid. If we adopt as a premise thata subset of this set-- ag-status nonrespondents,represents a random sample of the cell population,then the compliment of this subset-- identified ag-operations that are hog-status nonrespondents, mustalso represent a random sample of the cellpopulation. It will now be argued that the AdjustedEstimator's assumption is not reasonable under the

Page 39: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

adopted premise.

ILLUSTRATION OF COMPONENTPROPORTIONSBASED ON ADJUSTED MODEL

(WITHIN STRATUM)

POPULATION

NON-HOG OPERATIONS

HOG-STATUSNONRESPONDENTS(RANDOM SAMPLE OF POPULATION)

Figure 1

ILLUSTRATION OFCOMPONENT PROPORTIONSBASED ON REVISED MODEL

(WITHIN WEIGHTING CELL)

POPULATION

I NON-AG I NON-HOG OPERATIONS IHOG IOPERATIONS

AG-STATUS NONRESPONDENTS(RANDOM SAMPLE FROM POPULATION)

I NON-AG INON-HOG OPERATIONSI~~~T10NS IHOG-STATUSNONRESPONDENTS(RANDOM SAMPLE FROM AG-OPS)

Figure 2

All identified ag-operations that are hog-statusnonrespondents must either be non-hog ag-operations or hog operations. Because non-agoperations are missing from this group (all non-agunits are hog-status respondents-- they are non hogunits), it is difficult to argue that identified ag-operations that are hog-status nonrespondents canbe thought of as a random sample of the cellpopulation (see figure 2). The effect of thismisspecification is to bias the estimate downward.This can be explained as follows:

Identified ag-operations that are hog-statusnonrespondents have only one source of zerosu non-

35

hog ag-operations, whereas ag-statusnonrespondents have two sources of zeros-- non-agunits and non-hog ag-operations (see figure 2). Ittherefore seems reasonable to assume that identifiedag-operations that are hog-status nonrespondentsare more likely to be hog operations than ag-statusnonrespondents. It is thus argued that the AdjustedEstimator essentially underestimates the proportionof hog operations in the population. It gives anunbiased estimate of this proportion for the ag-status nonrespondents but gives a downward biasedestimate for those identified ag-operations that arehog-status nonrespondents.

The Revised Estimator is based on the augmentedmodel refened to earlier. The difference betweenthe underlying models of the Revised and AdjustedEstimators is that the Adjusted Estimator models allhog-status nonrespondents the same way(Assumption lA). The Revised Estimator models ag-status nonrespondents as if they are a randomsample of the cell population (Assumption lR), andmodels identified ag-operations that are hog-statusnonrespondents as if they represent a randomsample of those units identified to be ag-operations(Assumption 2R).

Note that both estimators model identified hogoperations that are hog-total nonrespondents asthough they represent a random sample of thoserecords identified to be hog-operations.(Assumption 2A is essentially the same asassumption 3R.)

5. RESULTS AND CONCLUSIONS

The main focus of the research was to observe howestimates obtained from the Revised Estimatorwould compare to those produced by the AdjustedEstimator using historical QAS data flIes. Theobserved effect of applying the Revised Estimator tohistorical data is an increase in the estimated totalnumber of hogs. This supports the argument thatthe Adjusted Estimator is biased downwards. Theaverage percentage increase relative to the AdjustedEstimator ranged from a low of 0.64 percent inIowa to a high of 2.96 percent in Georgia. Acrossthe five states studied, the increase averaged 1.53percent over all quarters. There were severalquarters for which the Revised Estimator produceda lower estimate than the Adjusted Estimator. Thiswas not due to the nonresponse model, but wascaused by the poststratification crossing designstrata. The Revised Estimator tracked well with the

Page 40: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

other estimators for all states. Figure 3 shows therelationship between the estimators for Illinois.

The structure of this Revised Estimator is appealingbecause it provides separate assumptions for each ofthe three stages of nonresponse. A logical argumenthas been made that the distribution of thenonrespondent population is different between theag-status and hog-status stages. The assumptionsthat nonrespondents are random samples at eachstage serve as a reasonable baseline approach, butas yet have not been validated by empiricalevidence. Further study is needed to determine theappropriateness of these assumptions.

U1NOIS ESTIMATED UST HOG TOTALS

EStTOlAl......,

_ • _ ••• ~ a ~ • _ ~ •• ~

Figure 3

REFERENCES

[1] Agricultural Statistics Board (1991), "Hogs andPigs" (September 1991 Report), NationalAgricultural Statistics Service.

[2] Cox, B. G. (1993), "Weighting ClassAdjustments for Nonresponse in IntegratedSurveys: Framework for Hog Estimation,"National Agricultural Statistics Service.

[3] Cox, B. G. (Unpublished), "WeightingSurvey Data for Analysis," National AgriculturalStatistics Service.

36

[4] Crank, K. N. (1979), "The Use of PartialInformation to Adjust for Nonresponse,"Agriculture, Economics, Statistics, andCooperatives Service.

[5] Fuchs, D. R., and Bass, R. T. (1990), "1989Agricultural Surveys Survey AdministrationAnalysis," National Agricultural StatisticsService.

[6] Kort, P. S. (1990), "Nonresponse Adjustments inNASS Agricultural Surveys," NationalAgricultural Statistics Service.

[7] Kart, P. S. and Thorson, J. (1989), "ImprovingVariance Estimates for Livestock Surveys,"National Agricultural Statistics Service.

[8] Garribay, R. and Huffman, J. E. Jr. (1991),"1990 Agricultural Survey AdministrationAnalysis," National Agricultural StatisticsService.

Page 41: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

IDENTIFYING AND CLASSIFYING REASONS FOR NONRESPONSE ON THE1991 FARM COSTS AND RETURNS SURVEY

Terry P. O'Connor, USDAINASSResearch Division/32S1 Old Lee Hwy., Fairfax, VA 22030

ABSTRACT

A research study was conducted during the 1991 FarmCosts and Returns Survey (FCRS) to identify andclassify the reasons given to field interviewers bypotential respondents for refusing to participate in thesurvey. The reasons given by field interviewers forcoding a sampled unit as inaccessible during the surveywere also identified and classified.

The research was conducted in all 48 surveyed states,and included 6 FCRS questionnaire versions. Uponreceiving a refusal, interviewers were instructed torecord the reason given on the face page of thequestionnaire. If no reason was given, or in caseswhere more than one reason was given, theinterviewers were instructed to discuss the concerns ofthe respondent in regards to completing an interview,and identify the main reason for refusing. When asampled unit was coded as a inaccessible, interviewerswere instructed to explain the reason for theinaccessible.

During the survey statistician's manual edit of thequestionnaires, the reasons for refusal or inaccessiblewere reviewed and compared to a coded list of reasonsfor nonresponse compiled from previous research intothis topic on the FCRS. Statisticians could consider thecomments from the interviewers as a match to a pre-coded response, or add additional codes for uniquecomments.

The nonresponse rate on FCRS averages 30% per year.The reasons behind the nonresponse have been a sourceof speculation for many years, and previously onlyanecdotal evidence was available on which to baseefforts to maximize response. This research shows theanecdotal evidence to have been on the mark in somecases an off in others.

INTRODUCTION

The Farm Costs and Returns Survey (FCRS) is a faceto face interview survey conducted annually duringFebruary and March by the National AgriculturalStatistics Service (NASS). It is a survey of theagricultural sector, and is conducted in the 48

37

conterminous states to collect detailed information onfarm expenditures and income, costs of production anddemographic data. The FCRS has a multiple framedesign utilizing a list sample of medium and largeranches and farms, and an area nonoverlap sample ofResident Farm Operators (RFOs) not represented by thelist, most of whom operate small farms (Rutz, 1991).

While all 48 FCRS states utilize the same surveyprocedures. the FCRS includes several questionnaireversions used in different combinations across thecountry. Tht~ versions used in a particular state for agiven year depend upon the agriculture in that state andthe areas of agricultural specialization being studied.Costs of producing the various agricultural commoditiesare studied on a year-to-year rotating basis. There arevariations in geography, sample sizes, farm or ranchtypes and sizes, economic conditions and respondentattitudes about the survey across the country;therefore, many factors must be considered whenmaking direct state to state comparisons of the surveyresults (Rutz. 1991).

The 1991 FCRS national response rate was 67.9percent, with a refusal rate of 24.9 percent and aninaccessible rate of 7.2 percent. Response rates on thesurvey have declined slightly over time, despiteextensive efforts to limit nonresponse. While NASSuses farm expense data from the FCRS in its reports,the primary user of the FCRS dataset is the EconomicResearch Service (ERS), which utilizes all of the FCRSdata in producing economic analyses and cost ofproduction reports (Rutz, 1991).

A benefit of collecting this type of information is thatsurvey managers can make adjustments to the public'sperception of a too long interview by testing ashortened version of the questionnaire (as is beingplanned for the 1992 FCRS). Headquarters can preparematerials to aid survey statisticians in training theirinterviewers to meet the challenges of the refusal typescommon across states. Survey statisticians shoulddevelop materials for use in their state workshops toprepare interviewers for situations common to theirstate. Expenenced interviewers who have had successin converting refusals into respondents should sharetheir techniques through panel presentations or groupdiscussions In this way, interviewers will maximize

Page 42: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

response rates on the initial contact by being preparedto discuss concerns and grievances brought up by therespondents, thus avoiding the additional time andmoney costs of are-contact.

BACKGROUNDThe research project to identify and classifynonresponse on the FCRS stems from four years ofpreliminary work which the author completed while onstaff in the South Carolina and Indiana State StatisticalOffices (SSOs).

Beginning with the 1985 FCRS, the author required thatthe South Carolina interviewers document the reasonsgiven by respondents who refused to participate in thesurvey. Previously, interviewers were likely to simplywrite "refusal" across the questionnaire, and thecomments the interviewer received from a refusal werediscussed second or third hand if at all, and weresketchy at best.

Then on the 1986 FCRS, South Carolina was selectedas one of six states to take part in a refusal conversionresearch project. All respondents who refused toparticipate in the survey during the initial contact wereto be re-contacted with the purpose of convincing themto complete an interview. It was apparent thatinterviewers selected to re-contact a refusal in thecurrent survey had an advantage if they were aware ofthe reason the respondent gave when initially refusing.

The information on "reasons for refusing" gatheredduring 1985 were discussed during the trainingworkshop for the 1986 FCRS, and responses to thereasons were developed by the interviewers. Toprepare for the re-contact required by the research,interviewers were again required to write on thequestionnaire the exact reason or circumstances behindeach refusal received on the FCRS. In this way,subsequent interviewers were made aware of the eventsof the initial contact.

The primary benefit of identifying the refusal types wasthat the interviewers could PREP ARE for commonsituations before encountering them in interviewsituations. According to interviewer comments, thispreparation improved their confidence in approachinginterviews, and even when they could not prevent arefusal, they were able to set the stage for therespondent's cooperation on other upcoming surveys.The second benefit was that, when approaching a re-contact on the refusal conversion project, thesubsequent interviewer could prepare for a specific

38

situation. A third benefit was that interviewers (withtheir supervisor's approval) could eliminate re-contactsof certain refusal types (violent refusals, death in thefamily, etc.), saving money and time during the criticaldata collection period.

Perhaps because the refusal conversion project was newand received much attention, or perhaps because therefusal identification preparation worked, the FCRSresponse rate in South Carolina for 1986 was 17percent higher than in 1985 (Dillard, 1987). Theauthor attributes most of this increase to interviewerpreparation on the initial contact since only a smallnumber of refusal conversions were obtained.

Upon transferring to the Indiana SSO, the author againinstructed the field interviewers to document the reasonsgiven by refusals. While the refusal identification andinterviewer preparation led to an initial decrease from35 percent to 31 percent in the refusal rate in Indiana,no additional gains have been evident, with the refusalrate averaging 31 percent over the past five years. Thelist of refusal types compiled during this time served asthe basis of the refusal list utilized for the nonresponseidentification project on the 1990 FCRS.

This research was conducted during February andMarch, 1991. The six test states included two statesthat averaged high non response rates, two states thataveraged mid-level nonresponse rates, and two statesthat averaged low nonresponse rates on the FCRS.Comments from the FCRS post-survey evaluationscompleted by survey statisticians around the countryalluded to problems with certain refusal types, but withonly anecdotal information to support their impressions.Evaluations included the following comments:

* "Some farmers feel it's none of our business. "* "Many farm operators refused due to the

length of the questionnaire .•* "Most of the second time contacts were

refusals and didn't want to be contactedagain. "

Some ... many ... most. The 1990 FCRS nonresponseidentification project was expanded to all surveyedstates for 1991 in order to put some numbers on thesevalid concerns and to better determine what NASS is upagainst when trying to minimize FCRS nonresponse.

RESULTSThe results of the 1991 refusal identification andclassification research are listed in Appendix A.

Page 43: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

Refusal types coded 01 - 53 were provided in thesurvey instructions; codes 200 - 409 were initially leftblank for state use, and states added refusal types basedupon their data collection experiences with the survey.

The most frequent reason given by the farmers whenrefusing to participate in the survey was "Would nottake the time I too busy". This response was given by1,395 of the 5,663 refusals encountered (24.6%), andwas recorded nearly twice as often as the next mostfrequent response. This seems to be strong evidencefor those involved with the survey who believe thatfarmers perceive the interview to take too long .

The second most frequent reason recorded was"Refused, but no reason given", mentioned 739 times,or 13.0 percent of the total refusals received. Thiscategory represents a difficult type of refusal to convertto a respondent: they just say NO. They mayunderstand what NASS is and its mission, and mayeven recognize the interviewer from previous contacts,but cut off any attempt at an interview before theirconcerns can be identified and addressed.

The third most frequent reason recorded was"Information too personal I none of your business",mentioned 508 times, or 9.0 percent of the totalrefusals received. Together these first three reasonsaccount for 46.7 percent of the total refusals received,and the top five reasons account for 58 percent, eventhough 52 different reasons for refusing were mentionedduring this research.

Refusal reasons mentioned as frequently and aswidespread as these five should be addressed on anational level. However, SSOs must review their statespecific data to determine which less frequentlymentioned reasons are important to their state.

This research also involved identifying and classifyingthe reasons given by an interviewer when coding asampled unit inaccessible, ShO\VI1in Appendix B.Inaccessible types coded 75 - 150 were provided in thesurvey instructions; codes 500 - 709 were initially leftblank for state use, and the states added inaccessibletypes based upon their data collection experiences withthe survey. While basically separate from the refusalidentification, certain respondent situations (such as"Family illness I death") could be coded either as arefusal, an inaccessible or a valid zero out-of-businessdepending upon the circumstances encountered.

One benefit of this research is that the number ofincomplete questionnaires, that is, those questionnaires

39

•• CIS • 11 •• lm&lS

1)4, v.td:I n::IC•• ,. ••• , too ~lX3,~tu.l"I:)~",*"CI5 ~klOPI'1ICnIl/n::ftlalyas~.111001d1kl....,.,./ldorado •.•.•••.Q6.ll'O~ •••.•• ~Ind~IU'I ••••••. "..'-'....,.

for which the respondent could not or would notprovide enough information for the interview to becompleted, is evident for the first time. For the 1991survey, 263 questionnaires were coded as incompleteand were not summarized. This amounts to 3.6 percentof the nonresponse, but is only 1.2 percent of the totalsurvey contacts.

The most frequent inaccessible reason recorded by theinterviewers was "Tried several times; could not reachanyone for an appointment. Just an extremely busyperson. ", given for 455 of the 1,653 inaccessiblesencountered (27.5 %). This is a surprising finding inlight of the six week data collection period.

The second most frequent inaccessible reason recordedwas "Illness .I death in the family prevents the operatorfrom responding", mentioned 182 times, representing11.0 percent of the total. This is a difficult situationfor an interviewer to encounter, and setting the stage tosee a respondent under better circumstances in thefuture is the best that can be accomplished.

The third most frequent reason recorded was "Farmrecords are not available until after the survey periodcloses", menlioned 172 times, representing 10.4 percentof the total. Together these first three reasons accountfor 48.9 percent of the total inaccessibles recorded,with 23 different reasons for coding an inaccessiblementioned during this research.

SSOs must review their state specific data to determine

Page 44: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

which additional reasons are important to their state.For instance, "The operator is away on an extendedvacation", normally thought to be a Midwest orNorthern situation for escaping the snow, was alsomentioned in California, Florida and other warmweather states.

DISCUSSION AND RECOMMENDATIONS

Data analysts, survey managers, statisticians andinterviewers are concerned about the levels ofnonresponse on the FCRS. Being close to the survey,they develop impressions about what factors are"driving" the nonresponse. The purpose of thisresearch is to identify the reasons for nonresponse, andto attach some numbers to them in order to rank theirrelative importance. Considering the nature of theFCRS, that it is a long, detailed interview of arespondent's operating procedures, income andexpenses, assets and liabilities and demographicinformation, many survey organizations would bethrilled to have a national response rate exceeding 70percent. Rather than defend this position, the surveymanagers at NASS and ERS continually strive toimprove the response rate on the survey.

Following a discussion of the preliminary results of thisstudy and from previous consideration of the subject,NASS and ERS have agreed to test a shortened versionof the questionnaire for the 1992 survey year. Adetailed discussion of the benefits of a shortenedquestionnaire version can be found in Dillard (1991).NASS will provide training and materials to the surveystatisticians at the regional workshops in January, 1993,to aid in training their field interviewers during stateworkshops. Additionally, the information is useful inthe weighting of survey results and summarization.

According to Turner (1992) the FCRS nonresponseadjustment factor is based on an assumption that allnonrespondents are operating farms; that is, theywould provide positive records if interviewed. Miss-coding valid zero reports as nonrespondents willpositively bias the expanded indications. Turner (1992)states that, "Identifying these (nonresponse) reasons willenable enumerators to improve classification of caseswhere no farm appears to exist as a valid zero.Continued emphasis should be given to classifying onlypositives as refusals and inaccessibles. Thosenonrespondents that have no indication of being mbusiness should be coded as out of business. "

Look at the pattern of nonresponse across the datacollection period, and an interesting picture appears. In

40

five of the seven survey weeks, more refusals occurredon Mondays than on any other single day, and duringthe other two weeks, the number of Monday refusals isnear the peak for the week. This is probably a functionof more interviews being attempted on Mondays, but itmay also indicate that Mondays are not the best day toattempt a long interview without a prior appointment.Otherwise, the distribution of refusals seems normallyspread throughout the survey period.

As might be expected, the number of inaccessiblespeaks near the end of the data collection period whentime constraints force the interviewers to begin to giveup on respondents who either cannot be located or whocontinue to put off the interview when contacted. Ingeneral, the incomplete interviews seem normallyspread throughout the data collection period.

As the results from the six test states in the 1990research served as an excellent predictor of the 1991results, there does not appear to be enough yearlyvariation to justify transferring this research into anoperational aspect of the survey. I recommend that thisresearch be repeated in three years. In this way, eachSSO can be updated on the causes of nonresponse likelyto be encountered, and patterns of nonresponse can becompared.

REFERENCES

Dillard, Dave and T. Gregory (1987). 1986 FCRSAnalysis. Report II. Response Rates. Interview Times.and Data Collection Costs. United States Departmentof Agriculture, National Agricultural Statistics Service.

Gregory, Thomas L. (1990). 1989 Farm Costs andReturns Survey. Survey Administration Analvsis.United States Department of Agriculture, NationalAgricultural Statistics Service, Agricultural StatisticsBoard, NASS Staff Report 5MB Number 90-03.

Rutz, Jack L. and C.L. Cadwallader (1991). 1990Farm Costs and Returns Survey. Survey AdministrationAnalysis. United States Department of Agriculture,National Agricultural Statistics Service, Research andApplications Division, NASS Staff Report 5MBNumber 91-04.

Turner, Kay (1992). Modification of FCRSNonresponse Adiustment Procedures. United StatesDepartment of Agriculture, National AgriculturalStatistics Service, Research Division, SRB ResearchReport Number SRB-92-08.

Page 45: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

APPENDIX A: Reasons Given By Respondents When Refusing To Participate on the 1991 Farm Costs and ReturnsSurvey, All States and Versions Combined.

FREQUENCY1,395

73950833231325525319513513412812012010597958972645856484642403636302922181855422221111111111111

CODE04.03.05.11.06.02.10.34.20.12.18.16.17.21.19.07.01.32.24.52.27.28.23.08.22.13.26.25.14.29.09.53.

365.366.15.

240.260.265.267.215.250.255.256.257.258.262.269.270.335.340.341.367.

REASONWould not take the time / too busy.Refused, but no reason given.Information too personal / none of your business."I do not like surveys / I do not do surveys."The respondent feels that surveys and reports hurt the farmer more than help.Contact attempted, but respondent refuses on all surveys, and refused on this one."I will have nothing to do with the Government."Respondent will do other surveys, but not financial surveys.Family illness / death.Respondent only does compulsory surveys.The respondent feels the operation's records are inadequate to complete the interview."My farm is too small to count / too small to be representative.""You contact me too often. "Operator would not keep appointments.Farm records are at the tax advisors / lawyers."I did this survey before, but not again."Known refusal, no contact attempted."This is not a farm."Violent f threatening refusals.Questionnaire not sent to the field to avoid jeopardizing cooperation on other surveys.Respondent is quitting farming.Out of business now, will not answer for the previous year.Wants to be paid for interview time and effort."I just did a different survey for your office."Spouse / secretary / etc. will not let the enumerator see the operator.The respondent does not think the information is kept confidential.Respondent does not want to report due to legal/financial problems.Respondent does not want to talk about farming.The respondent mentions a specific grievance with the SSO or NASS (other than confidentiality).Figures for the previous year were not typical."I just dId a survey for someone else."Would not answer the door even though they were home.Operator called the office after receiving the pre-survey letter, asked not to be contacted.The operator does not believe in statistics, so wili not complete an interview.The respondent mentions a specific grievance with the state cooperator.Needed partner to provide some information; partner refused.Getting divorced, too upset to respond.Operator has a grievance with the IRS.Fed up.Water rights curtailed, will not cooperate."The government is broke, how can we afford to send these people out?"NASS data is not accurate. Too political.Doing well financially -- does not want to respond.Operator has several operations and could not separate records for the sampled unit.Upset with the government -- has to spend $20,000 to dig up fuel tanks.Farmhouse and records lost in a fire, January, 1992.This survey is not needed.Responded previously on this survey, and asked to be excused this year.The respondent feels the operation is too complex for our survey.The respondent has a specitic grievance with ASCS.The farm operation is in a blind trust for a natior.al politician.His father would not do surveys, so neither will the son.

5,663 Total Responses* Code numbers not listed were not used.

41

Page 46: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

APPENDIX B: Reasons Given By Enumerators When Coding a Sample Unit as Inaccessible/Incomplete onthe 1991 Farm Costs and Returns Survey, All States and Versions Combined.

FREQUENCY CODE455 116.

263 150.

182 84.172 85.169 86.142 79.80 81.67 80.27 76.26 94.18 82.12 75.9 83.7 78.7 87.

5 591.3 667.2 92.2 119.

1 120.1 540.1 561.1 565.1 580.

REASONTried several times; could not reach anyone for an appointment. Just an extremely busyperson.INCOMPLETE -- Respondent provided partial information, but would not or could notprovide enough information to make the questionnaire complete.IlIness / death in the family prevents the operator from responding.Farm records are not available until after the survey period closes.Respondent postponed the interview beyond the end of the survey period.The operator is away on an extended vacation.The operator is away on business.The operator is away on a brief vacation.No respondent, as listed on the label, could be found.Inaccessible, but no reason given.The address on the label is summer-seasonal housing.No operation, as listed on the label, could be found.Access to the address on the label was denied by a gate I guard / etc.The address on the label is vacant / burned out / no structure exists.Enumerator workload prevented this operation from being contacted during the surveyperiod.The operator moved away during 1991.The questionnaire was returned too late to be included in the summary.Non-English speaking respondent; interpreter not available.Enumerator mistake; caught it too late to complete an interview within the surveyperiod.Operator has several operations and could not separate records for the sampled unit.Questionnaire from the enumerator lost in the mail.Operator had just gotten out of jail and would not talk with anyone from the government.Enumerator did not contact sufficiently; gave up too soon.Enumerator error, should not have collected the data.

1,653 Total Responses

* Code numbers not listed were not used.

42

Page 47: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

AN EVALUATION OF NONRESPONSE ADJUSTMENT WITHIN WEIGHTING CLASS CELLSFOR THE FARM COSTS AND RETURNS SURVEY

Kay Turner, USDA/NASSResearch Division Room 305, 3251 Old Lee Hwy Fairfax, VA 22030

INTRODUCTION and OBJECTIVES

The Farm Costs and Returns Survey (FCRS) isconducted by the National Agricultural Statistics Service(NASS) during February and March of each year. Thedata are collected in the 48 contiguous States from farmoperators/managers for the preceding year via personalinterviews. Various versions of the FCRS collectdetailed and aggregate expenditure, income, asset,liability and cost of production data. The data from theFCRS are used to ascertain the financial status of theagriculture sector by supplying information such as:farmers' net income, costs of producing commodities,financial situation of farm operators, debt held by farmoperators, and importance of production expense items.Farm organizations, agribusinesses, Congress, theDepartment of Agriculture, farmers, and ranchers aresome of the groups that utilize FCRS data (NASS,1989). Each year a sample is drawn for the FCRSusing both list and area frames. The list frame includesmainly large and specialty operations. The area frameincludes small operations not on the list frame, ornonoverlap (NOL) (NASS, 1991).

Nonresponse exists because all sampled farm operatorsdo not respond to the survey. The two types ofnonrespondents are refusals (the farm operator declinesthe interview) and inaccessibles (the farm operatorcannot be contacted). Kalton and Maligalig (1991)note,

"When total nonresponse occurs, the surveyanalysis may simply be carried out on the dataprovided by the responding elements.However, since responding and nonrespondingelements may differ systematically in theirsurvey characteristics, there is a risk with this.approach that the survey estimators wiIl bebiased. It is therefore a common practice toattempt to compensate for the missing dataarising from total nonresponse by some formof weighting adjustment".

Previous analysis (Turner, 1992) has indicated thatFCRS direct estimates at the U. S. level for five majorvariables over the years 1987-1990 are biaseddownward as follows: three major expense items are

43

biased downward about 10%, while land in farms andnumber of farms are biased downward about 20 %. Aninappropriak nonresponse adjustment for the list frameportion of the multiple frame (MF) estimate andundercoverage of farms are major causes of this bias.The 1990 FCRS nonresponse adjustment procedureswill be referred to as the current procedure. Currently,FCRS data are collected under the followingassumption.

Assumption a: All noorespondents wouldqualify for an interview and would have somepOSItive responses to the survey (i.e., areposi.tive records).

In the supervising and editing manual, field enumeratorsare instructed to code all out of business (zero) records,who would not quali fy for an interview, as respondents.These instmctions are intended to ensure that allnonrespondents would qualify for an interview, i.e.,have an agricultural operation. Since all interviews areface to face, it is possible to determine if a record is inbusiness or not. The underlying assumption of thecurrent list frame non response adjustment factor, whichassumes nonrespondents are similar to all respondents,conflicts with Assumption a because the adjustmentassumes noorespondents can include positive and zerorecords. TIle current area frame nonoverlap (NOL)nonresponse adjustment factor, which is applied at theState level, assumes nonrespondents are all positiverecords and is consistent with Assumption a.

Objectives 1 and 2 of this study involved the applicationof a simple adjustment (which is consistent withAssumption a) to list frame sample records using thefollowing weighting classes: 1) the design strata, and2) type/size cells over strata. Objective 3 examined theeffect of applying the adjustment at a type/size cell levelto area frame NOL records. Weighting classes or cellsbased on farm type and economic size are intended toprovide more homogeneity within weighting classes andheterogeneity across weighting classes than the currentclasses (strata for the list and States for the area NOL)provide. If the weighting classes are effective incapturing this homogeneity within and heterogeneityacross classes with respect to response probabilities,they will help reduce nooresponse bias. Previously

Page 48: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

reported control data were used to place nonrespondentsinto appropriate type/size cells.

questionnaires corresponding to the sampled name areclassified as refusals and inaccessibles.

The current list frame expansion factor isEXPANSION FACTORS

Assumption c: the r*(h) respondent samplingunits in stratum h are a simple random samplefrom the n(h) sampled units.

Assumption b: the n(h) sampled units instratum h are a simple random sample ofsampling units from the N(h) population unitsin the stratum.

This assumption is clearly true. Since all reportingunits do not respond, the original expansion factor ismultiplied by an adjustment factor to account for thenonrespondent reporting units. The second term ofEquation (1) is based on the following assumption.

The FCRS sununary currently has two methods foradjusting the list and area frames for non response dueto refusals and inaccessibles. Both procedures aredescribed below. Each sampled unit is initiallyassigned an original expansion factor that would beapplicable if there were no nonresponse, that is, if ausable report was obtained from each reporting unit.For both the area and list frames, the original expansionfactor is the first term of Equation (1). Thecorresponding assumption of this term is the following.

(1)N(h) n(h)

EF = -- '" ---n(h) r *(h)

The area frame sampling unit is a segment of land,usually about one square mile in area, within a land usestratum. Area frame reporting units are residents of thesampled segments who reported agricultural activity onthe previous June Agricultural Survey (JAS), and whoare NOL with respect to the FCRS list. The list framesampling unit is a name on the list sampling frame(LSF). The reporting units are all operatingarrangements associated with the sampled names. Inthe following notation, let h denote a sampling stratum;c denote a type/size weighting cell within a State; ands denote a State.

Furthermore, lett = h, c, or s as appropriate,N(t) = number of sampling units 10 the populationdenoted by t,n(t) = number of sampling units sampled from thepopulation denoted by t,g(t) = number of positive respondent reporting units int,f(t) = number of zero respondent reporting units in t,r(t) = g(t) + f(t) = number of respondent reportingunits in t,e(t) = number of positive non respondent reporting unitsin t,j(t) = number of zero nonrespondent reporting units int, andm(t) = e(t) + j(t) = number of nonrespondentreporting units in t.

Finally, letr*(t) = the number of respondent sampling units in t,andm*(t) = the number of non respondent sampling units int.

If Assumption c were true, then the m*(h)nonrespondent sampling units would also be a simplerandom sample of the n(h) sampled units in stratum h.This contradicts Assumption a, where allnonrespondents are assumed to be positive.

For a sampling unit of the area frame to be classified asnonrespondent, the interviews of all qualifying residents'in a land segment must be coded as refusals andinaccessibles. For the list frame, there is usually onereporting unit per sampling unit. If the reporting unitrefuses or is inaccessible, then it is a nonrespondentsampling unit. When there is more than one reportingunit associated with a list frame sampling unit, theseoperating arrangements are referred to as multipleoperations. A nonrespondent sampling unit exists in thecase of multiple operations when all of the

The following expansion factor is designed to beconsistent with Assumption a and meet Objectives I, 2,and 3 of this study. The level at which the nonresponseadjustment is calculated, which is represented by x,varies and will be described below.

N(h) g(x)+ e(x)EF = -- * (2)

n(h) g(x)

44

Page 49: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

The modified list frame expansion factors for Objectives1 and 2 each have the form of Equation (2) where thenonresponse adjustment factor (term two) is calculatedat the stratum level (x = h) for Objective 1 and atthe type/size cell level (x = c) for Objective 2. Thesenonresponse adjustment factors are consistent withFCRS Assumption a (i.e. all nonrespondents arepositives) since they are based entirely on positiverecords. The nonresponse adjustment factors ofObjectives (1) and (2) are based on the followingassumption.

Assumption d: the positive respondentreporting units {g(h), g(c)} are a simplerandom sample from the positive reportingunits in the stratum or weighting cell.

The current area frame expansion factor has the sameform of Equation (2) where the nonresponse adjustmentfactor (term two) is calculated at the State level (x =s). The nonresponse adjustment factor (term two) ofEquation (2) is applied at the type/size cen level (x =c) for Objective 3. The nonresponse adjustmentfactors of the current area frame expansion factor andObjective 3 are both based on Assumption d above.

DATA DESCRIPTION

For this project, 1990 FCRS data were used. Thevariables that were examined in the analysis are: totalexpenses, livestock expenses, labor expenses, land infarms, and number of farms. Nine States (Arizona,Colorado, Georgia, Illinois, Kansas, Montana, NewYork, North Carolina, and Wisconsin) could not beincluded for the list frame type/size cell analysisbecause the control data for size were missing. Controldata for list records were obtained from the listsampling frame. For area NOL records, controlinformation was collected on the previous JuneAgricultural Survey. Type categories were collapsedinto two classes: crops and livestock. The followingfive size cells were chosen with respect to annual totalgross value of sales: I} 1 to 9,999, 2} 10,000 to39,999, 3} 40,000 to 99,999, 4} 100,000 to 249,999, .and 5} 250,000 plus. Since variance inflation can resultwhen adjustment factors are not based upon adequatesample sizes, a goal of at least 20 positive respondentrecords with control data per weighting class was set.(Cox, 1991). To ensure uniform collapsing of cells, apriority scheme and logic flowchart were followed.

45

RESULTSObjective 1

Expansions and CV's were obtained for the fivevariables using the current list frame nonresponseadjustment factor. term two of Equation (1), applied toeach of the 281 strata in the 39 States. The modifiednonresponse adjustment factor, term two of Equation(2), was applied to each of the 281 strata in the 39States for Objective I. The modified nonresponseadjustment factor by stratum produced expansionsapproximatel y 9 % to 10% higher than the currentexpansions. Four of the CV's are slightly greater thanthose of the current method and one CV is the same.

Objective 2

Term two of Equation (2) was applied by type/size cellwithin State to evaluate Objective 2 for list frameestimates. A total of 212 cells were used over the 39States. These estimates are 10 % to 17 % greater thanthe current estimates. The CV's tend to be slightlylarger than those for the unadjusted expansions or foradjusted expansions at the stratum level.

Objective 3

To evaluate Objective 3, records were assigned to areaframe NOL type/size cells within State using the samelogic used for the list frame records. A total of 68 cellswere used for Objective 3. The estimates of Objective3 are very near the current NOL estimates. Three ofthe CV's are less than those of the current method andtwo are greater. Since the percentage change in theestimates is small for Objective 3. these results indicatethat application of the non response adjustment for thearea frame NOL within cells has negligible effect.

M1JLTIPLE FRAME RESULTS

List and area NOL results have been consideredseparately. Multiple frame results show the effect ofthe list and area NOL results together. AgriculturalStatistics Board numbers, which are considered to betruth, exist for number of farms and land in farms. Forthe three expense items, "Pseudo Board" values(Turner, 1992) were calculated that adjust somewhat forthe FCRS undercoverage of farms. The Pseudo Boardvalues represent a minimum value of truth since thereare other factors that also contribute to the downwardbias. Nonresponse adjusted MF estimates werecalculated at the 48 State level using type/size cells

Page 50: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

within State for both the list and area NOL indications.The list data for the nine States with unknown sizecontrol data were expanded by stratum using themodified non response adjustment factor, since type/sizecells could not be created. The probable effect of usingthe modified nonresponse adjustment by stratum forthese nine States on the 48 State MF indications, insteadof using the modified nonresponse adjustment bytype/size cell within each State, is to bring theindications downward. These nonresponse adjusted MFestimates as well as the current MF estimatesarecompared to the Board and Pseudo Board estimatesin Table 1. The nonresponse adjusted MF estimates forthe expense items closely match their Pseudo Boardvalues, ranging from 3.7% below to 1.3% above. Landin farms adjusted for non response is still biaseddownward by about 13%. This bias is probably due in

part to the tendency of farm operators to underreporttotal farm acreage (McClung, 1988). However, thisbias is about 8 percentage points smaller than thecurrent bias of 21 %. This reduction in bias,represented by the last column in Table 1, for land infarms is comparable to the reduction for the expenseitems, indicating about an 8 to 11 percentage pointeffect on the MF estimates for these items. Oneimportant characteristic of these four items is thatapproximately 23 % of the MF estimates are from thearea frame NOL. The reduction in bias for number offarms is only about 4 percentage points, butapproximately 58 % of the MF estimate is from the areaframe NOL. Since the nonresponse adjustment hadnegligible effect on the area frame NOL, the biasreduction for the MF estimate is also small.

Table 1: 1990 Current MF Estimates and Nonresponse Adjusted MF Estimates Using Type/Size Cells Within Stateat 48 State Level Compared to 1990 Board and Pseudo Board Estimates.

Item 1990 Board & Current Nonresponse Nonrs. Adjstd. (%Pseudo Board MF Adjusted Type/Size of Board) - Current

Estimates (mil.) (% of Cells MF (% of MFBoard) Board) (% of Board)

Total Expenses . 150,269 87.9% 96.3% 8.4%

Livestock Expenses 16,864 88.9% 97.1 % 8.2%

Labor Expenses 14,828 90.1 % 101.3 % 11.2%

Land in Farms 985 78.8% 86.6% 7.8%

No. of Farms 2.1352 82.1 % 85.8% 3.7%

46

Page 51: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

CONCLUSIONS

Results indicated that the largest bias reduction for thelist frame portion of the estimate occurred usingtype/size cells over strata. Evidently, these cells do amore effective job of grouping homogeneous recordstogether than the current design strata. There was littleeffect, however, from using type/size cells for areaframe NOL records primarily because cells could onlybe created in 17 of the 48 States because of the goal ofat least 20 records per cell. A major factor to theremaining downward bias on all five items is theundercoverage of farms by FCRS. The CV's of thenonresponse adjusted estimates increased slightly ascompared to the current CV·s. This probably reflectsmore the failure of the variance approximationprocedure than the nonresponse adjustment procedures.

RECOMMENDA nONS

Analysis of 1990 data indicated the adjustment shouldbe made using type/size weighting classes within eachState for the list frame records. The recommendationfor Objective 3 was optional, since the impact oftype/size cells within State was negligible on the areaside. It was recommended that analysis be conductedon the 1991 FCRS data to determine if type/sizeweighting classes within each State were needed, or ifthe list frame strata were adequate weighting classes.The 1992 FCRS 'used the modified nonresponseadjustment at the design stratum level, since 1991 listframe stratification changes were expected to betteraccount for type and size of farm and since the creationof type/size cells would have added complexity to thesummary process. Since the nonresponse adjustment isbased on the assumption that all nonrespondents haveoperating farms, survey training materials andinstructions should continue to emphasize that refusaland inaccessible sampling units must be farm operators.

REFERENCES

(1) Cox, Brenda G. (1993), "Weighting ClassAdjustments for Nonresponse in Integrated Surveys:Framework for Hog Estimation." SRB Research ReportNumber SRB-93-03, National Agricultural StatisticsService.

(2) Kalton, Graham and Dalisay Maligalig (1991), "AComparison Of Methods Of Weighting Adjustment ForNonresponse, " 1991 Annual Research Conference

47

Proceedines, Bureau of the Census, pp. 409-428.

(3) McClung, Gretchen (1988), "A CommodityWeighted Estimator," Staff Report No. SRB-88-02,National Agricultural Statistics Service.

(4) National Agricultural Statistics Service (1989),"Interviewer's Manual 1989 Farm Costs & ReturnsSurvey (FCRS), " Author.

(5) National Agricultural Statistics Service (1991)," 1990 Farm Costs and Returns Supervising and EditingManual," Author.

(6) Turner, Kay (1992), "Modification of FCRSNonresponse Adjustment Procedures," SRB ResearchReport Number SRB-92-08, National AgriculturalStatistics Service.

ACKNOWLEDGMENTS

The author appreciates the significant contributions oftime and effort by Bill Iwig and Dave Dillard. Thanksto Fatu Wesley for her valued support and advice, toJim Burt for variance analysis, to Fred Warren andMartin Ozga for downloading the data, to Dick Clarkand Dan Le.dbury for supplying FCRS and Boardnumbers, and to Phil Kott for his valuable review of theoriginal NASS staff report (Turner, 1992). Sincerethanks to everyone who provided assistance on thisproject.

Page 52: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

SELECTED RESULTS OF THE INCENTIVE EXPERIMENTON THE 1992 FARM COSTS AND RETURNS SURVEY

Diane K. Willimack, National Agricultural Statistics Service, U.S. Department of AgricultureSurvey Management Division, Room 4151, South Building, Washington, D.C. 20250-2000

ABSTRACT: A split ballot experiment was conductedon the 1992 Farm Costs and Returns Survey (FCRS) infour States to test the effects of a prepaid nonmonetaryincentive on response rates and related variables.Results showed a statistically significant improvementin response rates of 5.4 percentage points due to theincentive. The incentive appeared to be the mosteffective among farms stratified in the smallest andlargest sales classes. In addition, the incentive appearsto have enhanced identification of non-eligible sampleunits (non-farms) over the No Incentive group, reducinga potential nonsampling error.

KEY WORDS: Incentive, prepaid, nonmonetary, splitballot experiment, response rates

INTRODUCTION

The Farm Costs and Returns Survey (FCRS) is anationwide multiple frame survey of U.S. farmoperators, conducted annually by the NationalAgricultural Statistics Service (NASS) of the U.S.Department of Agriculture (USDA). The purpose ofthe FCRS is to gather data for estimating totalexpenditures, net farm income, cost of production forselected agricultural commodities, and other economicindicators of the financial condition of the agriculturesector. In addition, the FCRS provides a data base forprice index construction and microeconomic analyses.

The FCRS is considered by NASS personnel, from fieldenumerators to headquarters staff, to be a challengingsurvey. Face-to-face interviews averaging just over 90minutes in length request detailed expenditure andincome information, which is highly sensitive in nature,from individual farm operators. It is not surprising thatthe FCRS has suffered low U.S. level response ratesthat have declined in recent years from 69 percent to 63percent (Rutz, 1993).

Concern for declining response rates has promptedNASS, like other survey organizations, to search formethods by which potential respondents may beencouraged to participate in its surveys, particularly theFCRS. In recent years, extensive broad-based publicrelations materials have been disseminated in the

popular farm press, and special refusal conversionefforts have been attempted. In addition, FCRSparticipants are routinely offered an Individual FarmFinancial Analysis, which summarizes a number ofeconomic characteristics of each respondent's own farmin comparison with farms of similar type and size in thesame State, based on data from the survey. While noneof these efforts is considered inconsequential, they havebeen met with little or no apparent success at actuallyincreasing response rates.

In order to continue to explore methods for increasingresponse to the FCRS, NASS decided to investigate theuse of incentives, a method often considered in surveyresearch for the purpose of influencing surveyparticipation. In such studies, typically a monetary ornonmonetary incentive is either prepaid unconditionallyor promised for survey completion. In the studyreported in this paper, a split balJot experimental designwas utilized in four States to test the effect of a prepaidnonmonetary incentive on response rates on the 1992FCRS.

LITERATURE REVIEW

Incentives have long been used in mail surveys,particularly of the general population, in order toincrease response rates. A wide body of publishedliterature suggests that monetary incentives are moresuccessful at eliciting response than are nonmonetarygifts, and that prepaid incentives are more effective thanrewards promised upon return of completed surveyinstruments. See for example qualitative review articlesby Armstrong (1975) or Linsky (1975), or quantitativemeta-analyses by Church (1993), Heberlein andBaumgartner (1978), or Yu and Cooper (1983). Inparticular, Church (1993) found promised incentives ofany kind, monetary or nonmonetary, to be ineffective,asserting that "[t]hese types of incentive plans aresimply not worth the energy involved" (p.75).

A recent study reports the results of an incentiveexperiment in an establishment survey, a mail survey ofsmall construction subcontractors asking for detailedinformation about employees' health insurance coverage(James and Bolstein, 1992). Tested for their effect on

48

Page 53: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

response rates were prepaid monetary incentive amountsincreasing from $1 to $40, along with a promisedmonetary incentive of $50 contingent upon surveyreturn. Each of these was coupled with up to fourfollow-up mailings. The promise of $50 performed nobetter than the control condition of no incentive withmultiple mailings. Although prepayment of $20maximized survey response, it was not considered costeffective, and the authors concluded that token amountssuch as $1 or $5 were sufficient to influence surveyparticipation.

James and Bolstein attribute their posIhve incentiveeffects to social exchange theory: "By giving moneythe researcher extends a token of trust to the surveyparticipant and initiates a social exchange relationshipwhich invokes a social obligation for the participant toreciprocate in kind" (p.451). Social exchange theoryand the principle of reciprocation are typically invokedto explain the positive effects of incentive use insurveys of households and among the general population(Dillman, 1978; Groves, Cialdini, and Couper, 1992).Applying this reasoning to estabhshment surveys is anontrivial matter because it is unclear who in anestablishment we are attempting to influence with theuse of an incentive. Incentives are meant to influencethe survey participation decision and to motivate therespondent. However, in an establishment survey, thedecision-maker may not be the actual respondent, themost knowledgeable provider of the information beingsought. This is typically the person who has access toand understanding of any records to be used as a sourcefor responding (Edwards and Cantor, 1991).

Farmers, on the other hand, are likely to be both thedecision-maker regarding survey participation and themost knowledgeable provider of the information soughton the FCRS. Hired accountants or record-keepingservices were used routinely for keeping farm financialrecords by less than half of the farms with sales of$250,000 or more in 1987. Moreover, farms of thissize account for only about 5 percent of all U.S. farms.Approximately 11 percent of all farms subscribed torecord-keeping services in 1987. Although theproportion of farmers using outside services forfinancial record-keeping may have increased in recentyears, it is still likely that the majority of farmoperators are the most knowledgeable source of thefinancial information being requested by the FCRS,since evidence suggests that even service users havesome close hands-on method of keeping track offinances, such as a workbook or ledger (Willimack,1989).

49

Furthermore, almost 99 percent of U.S. farms weresale proprietorships, legal partnerships, or family-heldcorporations in 1990. These operations are, bydefinition, family farms and are attached to farmoperator households (Ahearn, Perry, and El-Osta,1993). Indeed, "family farms meet eitherdefinition," households or establishments (Edwards andCantor, 1991). Farmers may call upon a decision rulenot unlike that of a householder regarding a request forsurvey participation. Thus using incentives to influencefarm operators takes advantage of their characteristicsas 1) the person most knowledgeable of the informationrequested, 2} the decision-maker regarding surveyparticipation, and 3) a householder.

DESIGN AND METHODOLOGY

The FCRS utilizes a multiple frame design, consistingof a list frame and a complementary area frame. Thelist frame accounts primarily for medium, large, andspecialty famls. For the FCRS, the operations on thelist are stratified by type and size, including only thosewith "estimatt>A.Ivalue of sales" of $20,000 or more.The area frame, which is stratified geographically andby land use, covers farms that are missing from the list,called nonoverlap (NOl). NOl farms are typically,though not ex~Jusively, small farms. The U.S. list andarea NOl sample size for the 1992 FCRS was 21,000.

FCRS data are collected using several questionnaireversions in face-to-face interviews averaging 90 minutesin length. Data for the previous calendar year arecollected during February and March of the currentyear, which coincides with the end of the traditional taxpreparatIOn period for U.S. farms. Any operationhaving agriculture qualifies for an interview, althoughit may not qualify as a farm. The official USDAdefinition of a farm is any establishment which sold orwould normally have sold at least $1,000 of agriculturalproducts during the previous year.

The 1992 FCRS Incentive Experiment utilized a "splitballot" expenmental design in four States -- Georgia,Idaho, Kansas, and Michigan -- representative ofhistorically luw or dedining response rates, as well asagricultural and geographic diversity. Sample elementswere randomly assigned to the two incentive treatmentgroups by version and by stratum, resulting in 1181sample units In the Incentive Group and 1183 sampleunits in the No Incentive Group.

The nonmonetary incentive item was a vinyl pocketportfolio containing a notepad and a removable solar

Page 54: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

The response rates were 63.3 percent in the IncentiveGroup and 57.9 percent in the No Incentive Group,representing a statistically significant increase inresponse rates due to the incentive of 5.4 percentagepoints (two-sided p-value=0.009). The reduction ofnearly 5 percentage points in refusal rates, from 29.9percent in the No Incentive Group to 25.0 percent inthe Incentive Group, is also statistically significant(two-sided p=O.Oll). See Table 1.

No Two-tailIncentive Incentive p-value 'l:.1

Response Rate 63.3% 57.9% 0.009(n) (1072) (1102)

Refusal Rate 25.0 29.9 0.011(n) (1072) (1102)

Eligibility Rate 90.8 93.2 0.034(n) (1181) (1183 )

11 Multiple frame, unweighted, ratio estimates.'£:.1 Ho: Rate(Incentive) = Rate(No Incentive).

Furthermore, the eligibility rate of 90.8 percent in theIncentive Group is significantly lower than thecomparable rate of 93.2 percent in the No IncentiveGroup (p=0.034). That is, sample units that had noagriculture, and thus were not eligible for the FCRS,were more likely to be identified among incentiverecipients than among nonrecipients. The lack ofeligibiity emanates primarily from the list frame. Thefarm status of individual list records has likely not beendetermined as recently as operations sampled from thearea frame, which have been contacted at least once inthe past year.

conducted for this paper, only the unweighted resultswill be reported in detail, as this appears to be morecommon in the survey literature. (By our definition,"unweighted" analysis completely ignores the sampledesign, while a ·weighted" analysis incorporates thedesign.) Reference will be made to weighted analysisin summary, and exceptions will be noted. In all cases,a one-tail test of significance is of primary interest.The decision criteria will be relative to a one-tailconfidence level of alpha=O.lO, since the widevariability of sampling weights in the NASS stratifiedsample design is not particularly beneficial to theestimation of proportions. Nevertheless, two-tail p-values will always be reported, allowing the reader tomake hislher own judgments.

Response Rates, Refusal Rates, andEligibility Rates, by Incentive Group,1992 FCRS Incentive Experiment 11

Table 1:

Eligibility Rate = Number Eligible for FCRSNumber of Sample Units

calculator. It bore an imprint identifying the survey,along with the State's and the Agency's identification.It was token in nature and not intended to representcompensation for the respondent's time and surveyparticipation. The item cost $7.06 per unit to produce.

The incentive item was prepaid; its receipt was notcontingent upon survey participation. It was enclosedalong with the pre-survey notification letter mailed tosample units prior to interviewer contact. Letterscontaining the incentive item acknowledged itsenclosure, saying, "While this token in no way equalsthe value of your contribution to this survey, itrepresents our appreciation for your consideration."Letters to sample units in the No Incentive Group didnot include the incentive item, nor did they mention it.Other than the above additional statement, letters inboth groups were identical, written and signed by theState Statistician of each participating State. All pre-survey letters were addressed to the farm operator oroperation by name and were mailed 1-2 weeks prior tothe beginning of FCRS data collection in each State.

Response Rate = Number of Completed InterviewsNumber Eligible for FCRS

Refusal Rate = Number of RefusalsNumber Eligible for FCRS

On the FCRS, the various sample disposition rates havethe following definitions:

Survey interviewers were informed of the identity ofincentive recipients and nonrecipients by way of aspecial code on each questionnaire label. They wereallowed to use this knowledge in an appropriate mannerupon contact with incentive recipients only. They werenot allowed to personally carry or possess the incentiveitem, nor were they allowed to promise it as a gift forsurvey completion among nonrecipients.

Although both unweighted and weighted analyses were

RESULTS

NASS traditionally reports response rates in terms ofsample counts, with each sample unit being given equalweight. That is, these calculations do not take accountof the sample design, and sampling weights are notapplied.

50

Page 55: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

1/ Multiple ::rame, unweighted, ratio estimates.'1:./ Ho: Rate(lncentive) = Rate(No Incentive).

these reversals become negligib]e as well asnonsignificant when the sample design is considered andappropriak sampling weights are applied.

Response Rates and Refusal Rates, bySiratified Farm Value of Sales Class,byIncentive Group, 1992 FCRS IncentiveExperiment 11

Table 2: State-level Response Rates and RefusalRates, by Incentive Group, 1992 FCRSIncentive Experiment 11

No Two-tailState Incentive Incentive p-value '1:./

GeorgiaResponse Rate 73.5% 66.0% 0.070Refusal Rate 18.5 23.0 0.227(n) (238) (244)

IdahoResponse Rate 75.3 67.9 0.061Refusal Rate 19.2 25.7 0.078(n) (255) 1265)

KansasResponse Rate 44.0 41.2 0.472Refusal Rate 37.2 42.4 0.183(n) (309) (311)

MichiganResponse Rate 64.8 59.9 0.235Refusal Rate 22.2 25.9 0.314(n) (270) (282)

1/ Multiple frame, unweighted, ratio estimates.'1:./ Ho: Rate(lncentive) = Rate(No Incentive).

As can be seen in Table 2, the incentive appears tohave increased response rates and reduced refusal ratesin each of the four States. Even though the experimentwas not specifically designed with sufficient samplesizes to support State level inferences, the response ratedifferences in Georgia and Idaho were large enough toreach significance. In addition, the refusal rate issignificantly reduced (according to a one-tail test witha]pha=0.10) by the incentive in Kansas, a State that hashistorically found FCRS survey participation to beparticularly troublesome.

Table 3:

Farm Valueof Sales

less than $20.000Response RateRefusal Rate(n)

$20,000-$39,999Respons\: Rat\:Refusal Rate(n)

$40,000-$99,999Response RateRefusal Rate(n)

$100,000-$249,999Respons\: RateRefusal Rate(n)

$250.000-$499,999Response RateRefusal Rate(n)

$500,000 or moreResponse RateRefusal Rate(n)

NoIncentive Incentive

81.3% 64.3%8.2 22.1

(134) (140)

61.6 68.024.4 19.4(86) (103)

69.4 62.221.7 29.4(180) (177)

59.7 57.429.3 30.9(273) (256)

56.0 61.835.2 27.4(125) (157)

57.7 46.126.6 38.7(274) (269)

Two-tailp-value

'l:.1

0.0010.001

0.3640.409

0.1450.094

0.5930.696

0.3270.160

0.0070.003

Table 3 shows the distribution of Incentive and NoIncentive response rates and refusal rates by ffestimatedfarm value of sales,· the variable used explicitly forstratification purposes in the list frame and forclassification purposes in the area frame. Here aninteresting phenomenon appears. Increases in responserates due to the incentive are highly significant in thesmallest size class and in the largest size class only,although small mid-size farms in the $40,000-$99,999sales class appear to have been significantly influencedby the incentive as well. Two size classes, $20,000-$39,999 and $250,000-$499,999, exhibit nonsignificantreversals, where the Incentive Group suffered lowerresponse rates than the No Incentive Group. However,

51

Estimation tak ing account of the sample design enablesinference of these experimental findings to thepopulation of U.S. farm operators. Although the ]evelestimates change when sample weights are applied, thedirection of the differences and their statisticalsignificance are retained in all cases reported above,except in the largest sales class. Moreover, whenanalysis is limited to list frame sample elements, thislargest size class exhibits a highly significant increasein response rates of more than 12 percentage points(p=O.006), due to the incentive. It appears that theloss of significance in the multiple frame estimate is dueto the influence of an area frame refusal or inaccessiblesample element with a large sample weight.

Page 56: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

In order to further understand the effects of theincentive, additional analysis was undertaken using thelist frame sample only. The greatest samplingefficiency is gained in the list frame since it accountsfor the largest farms, which contribute the most to theFCRS estimates of expenditures and income. Thus,response from these list frame sample elements iscrucial to data quality. However, it is well documentedthat FCRS response rates decline as farm size increases(Rutz, 1993). Therefore, it is important to evaluate theeffect of the incentive in the list frame.

A regression was performed incorporating the sampledesign using data from list frame records only.Response behavior was regressed on incentive receipt,the stratification variable "estimated farm value ofsales," as well as a term representing their interaction,along with variables that controlled for State andquestionnaire version effects. While the incentivevariable by itself proved nonsignificant, its interactionwith the size variable significantly increased theprobability of response of a list sample unit. Theselected estimated equation is:

(probability of Response) * 100% = Intercept

- 3.29 In(Size) + 0.56 [In(Size) * Incentive](p=O.067) (p=O.091)

+ State variables

+ Questionnaire Version variables,

where Size = Estimated farm value of sales

and Incentive = 1 if incentive,o if no incentive.

Incentive receipt appears to have significantly offset thedecline in the probability of response among largerfarms.

CONCLUSIONS

Clearly the inclusion of a nonmonetary incentive withthe pre-survey notification letter resulted in asignificantly higher response rate on the 1992 FCRS inthe four States included in the experiment. Theincrease in response rates appears primarily due to areduction in refusals among those who received theincentive. In addition, the incentive appears to haveenhanced identification of non-eligible sample units

52

(non-farms) over the No Incentive control group,reducing a potential nonsampling error. Incentiverecipients who had no agriculture may have been moreattentive to the survey request and more determined tonotify the interviewer of their non-farm status, ratherthan to become refusals or inaccessible sample units.

Furthermore, the incentive appears to have had thelargest and most significant effects on response ratesamong the smallest farms from the NOL area framesample and among the operations sampled from thelargest farm strata on the list frame. These two groupsof farms have in common high likelihood of repeatedcontact for NASS surveys. Area frame NOL farmsremain in the NASS sample for five years. Thoseselected for the FCRS are in their fourth and fifth yearsof their sample rotation and have likely been calledupon repeatedly for a variety of NASS surveys, for itis their role to account for the incompleteness of thelist. The largest list frame operations, likewise, arefrequently subject to repeated contact for NASSsurveys. As the largest operations, they account for asubstantial portion of agricultural production. Inaddition, their higher degree of variability results Intheir being sampled at a higher rate.

The incentive, as a token item, cannot havecompensated respondents, especially the large farms,for their FCRS participation. Thus these results cannotbe interpreted using an argument based on economicexchange. Instead, the effects of the incentive must beinterpreted in a social context. The incentive may have1) drawn attention to the pre-survey letter, 2)legitimized the survey request, 3) identified anddifferentiated the survey sponsor (this is especiallyrelevant since the U.S. Census of Agriculture wasconducted just prior to the 1992 FCRS), 4) notified thefarmer of the impending visit by an interviewer, and 5)offered a sign of appreciation, enabling the trustnecessary for social exchange to occur.

Indeed, the latter explanation drawing upon socialexchange theory, may be most appropriate, given thatthe two groups most affected by the incentive, thesmallest NOL farms and the largest list frame farms,are those most frequently requested to participate inNASS surveys. It is likely that a relationship, rapport,has been built between these farm operators and theinterviewers. Unconditional receipt of the incentiveitem may have been perceived by these FCRSrespondents as a token of appreciation consistent withthe ongoing social relationship of survey contact. Thismay have been particularly meaningful among thesmallest farms, which are noncommercial in nature, and

Page 57: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

their operators may not even consider themselves to befarmers. The incentive may have symbolized the trustthat, according to Dillman (1978), is necessary forsocial exchange to successfully occur, invoking socialnorms in the respondent consistent with surveyparticipation. Thus, like James and Bolstein (1993), itseems reasonable to attribute positive incentive effectsin the FCRS, an establishment survey of farmers, tosocial exchange theory.

ACKNOWLEDGMENT

The author wishes to acknowledge the technicalassistance of several staff members of the SurveyResearch Branch of NASS, particularly Jim Burt andPhil Kott, who advised on analysis of the data takingaccount of the sample design.

REFERENCES

Ahearn, M. C., J. E. Perry, and H. EI-Osta (1993)"Economic well-being of farm operator households,1988-1990." AER-666, Economic Research Service,U.S. Department of Agriculture, Washington, D.C.

Armstrong, J. S. (1975) "Monetary incentives in mailsurveys." Public Opinion Quarterly 39:223-250.

Church, A. H. (1993) "Estimating the effects ofincentives on mail survey response rates: A meta-analysis." Public Opinion Quarterly 57:62-79.

Dillman, D. A. (1978) Mail and Telephone Surveys:The Total Design Method. New York: John Wileyand Sons.

Edwards, W. S., and D. Cantor (1991) "Toward aresponse model in establishment surveys," Ch. 12 inMeasurement Errors in Surveys, edited by P. P.Beimer, R. M. Groves, L. E. Lyberg, N. A.Mathiowetz, and S. Sudman. New York: John Wileyand Sons, Inc., pp. 211-233.

Groves, R. M., R. B. Cialdini, and M. P. Couper(1992) "Understanding the decision to participate in asurvey." Public Opinion Quarterly, 56:475-495.

James, J. M., and R. Bolstein (1992) "Large monetaryincentives and their effect on mail survey responserates." Public Opinion Quarterly 56:442-453.

53

Linsky, A. S. (1975) "Stimulating responses to mailedquestionnaires: A review." Public Opinion Quarterly39:82-101.

Rutz, J. L. (1993) "Farm Costs and Returns Survey for1991 and 1992: Survey Administration Analysis."Staff Report #SAB-93-01, National AgriculturalStatistics Sen-ice, U.S. Department of Agriculture,Washington, D.C.

Willimack, D. K. (1989) "The financial record-keepingpractices of U. S. farm operators and their relationshipto selected operator characteristics." Paper presentedat the Annual Meetings of the American AgriculturalEconomics Association, Knoxville, Tennessee, August.

Yu, J., and H. Cooper (1983) "A quantitative review ofresearch design effects on response rates toquestionnaires." Journal of Marketing Research 20:36-44.

Page 58: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

TIME RELATED COVERAGE ERRORS AND THE DATA ADJUSTMENT FACTOR (DAF)

Jeffrey T. Bailey, USDAINASSResearch Division/32S1 Old Lee Hwy., Fairfax, VA 22030

ABSTRACT

The National Agricultural Statistics Service (NASS)conducts quarterly surveys to estimate crop acreage,grain stocks and hog inventories. Sample replicatesfrom the stratified sample design are surveyed on arotating basis to allow for quarter to quarter overlapwhile bringing other operations into the survey. Withthis design, farming operations may be enumeratedfrom one to four quarters in a particular year's surveycycle.

Operations are sometimes reported as ·out-of-business· in one of the quarterly surveys when theywere in business during a previous quarter. While thisis not a problem if the questionnaires are correctlycoded, a review of survey data reveals a significantnumber of coding errors in one quarter or the other.This between-quarter discrepancy in an operation'sbusiness status can change the coverage of thepopulation (particularly if the change is due to incorrectcoding) and have a major impact on the resultingindications.

This study looked at the effect of the coveragechange on the indications and the reasons forquestionnaires being coded as ·out-of-business·. Fromthis research we hope to determine: 1) the extent towhich those ·out-of-business· changes represent datacollection errors rather than real operation changes, 2)how to reduce the number of operations incorrectlybeing coded ·out-of-business" and, 3) whether the dataare increasing for operations remaining in business tooffset operations legitimately going ·out-of-business·.

SUMMARY

The National Agricultural Statistics Service (NASS)conducts quarterly surveys to estimate crop acreage,grain stocks, and hog inventories. The replicated,stratified sample design results in sampled operationsbeing surveyed in a rotating fashion, allowing for somequarter to quarter overlap while reducing respondentburden. A new sample begins in June with quarterlysurveys in the following months of September,December, and March.

A dilemma arises as the year's survey cycleprogresses beyond the June base survey, because the

54

percentage of ·out-of-business· operations increases.This creates a situation where the indications from thesurvey decrease and the population coverage maybecome incomplete. Observations show thatapproximately 4 to 6 percent of operations change from"in business" one quarter to ·out of business· the next.Reviewing the questionnaires indicates that a substantialnumber of these were inaccurately coded or lackedcomplete information.

The Data Adjustment Factor (DAF) adjusts the datafor duplication and eliminates data that should not besummarized. When an operation is ·out-of-business·the DAF is zero. Calculations of the average DAFshow that it continually decreases the further you getfrom June. The DAF reduced the Decemberexpansions relative to June by about 2 percent in 1991and 1 percent in 1992. This drop from June issubstantial, but how much of it reflects a legitimatechange in the target population? What led to thereduction of the DAF impact in 1992 and how can wefurther reduce its effects?

The DAF should continue to be monitored and effortsbe made to reduce its artificial impact upon the surveyindications. Some suggestions to reduce the DAFdecline are more training, changes in coding oldreplications, and the use of historic data to confirm·out-of-business· operations. These suggestions willlikely not completely eliminate the DAF problem andmore ideas should be developed and studied to lessenand monitor the DAF impact.

INTRODUCTION

The National Agricultural Statistics Service (NASS)conducts many surveys to estimate inventory andproduction of various agricultural commodities. As apart of its Agricultural Survey Program, NASSconducts quarterly surveys to estimate crop acreage,grain stocks and hog inventories. Analysis ofDecember 1991 Agricultural Survey data showed thatthe December crop indications for planted acres werealways lower than the June indications. Within agrowing season the reported planted acreage of a cropshould not change, unless intentions were reported inJune and the crop was never actually planted. It was

Page 59: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

discovered that many operations which reported cropsin June were now "out-of-business" in December.

Reviewing the data of these "out-of-business"operations focused attention on the Data AdjustmentFactor (DAF). The DAF is used to adjust forduplication and to eliminate any data reported on an"out-of-business" operation. The value of the DAF isalways inclusively between zero and one. The averageDAF was calculated for successive quarterly surveysand found to decline as time passed. Several reasonscan account for this and many ideas have beenexpressed.

This paper will begin with a description of themultiple frame surveys at NASS and how coverageerrors can occur as time passes. Then the analysis ofthe DAF will be presented.

NASS MULTIPLE FRAME SURVEYS

NASS conducts many surveys and for each It IS

necessary to define the sampling population or frame ofunits to sample. For most NASS surveys the targetpopulation is all operations with the agriculturalcommodities of interest. NASS maintains a list frameof names thought to be farm operators in each state forits sampling. Considerable time and resources arespent in the state offices updating and maintaining theselists. In addition to the samples drawn from these lists,samples are drawn from an area frame of all land in theU. S. from which estimates are generated to measurelist incompleteness. Together the two frames form amultiple frame survey design which NASS uses inmany of its surveys.

This study focuses on NASS's quarterly multipleframe Agricultural Surveys. The list sample is selectedin the spring with the surveys conducted during June,September, December and March. During the basesurvey in June a complete area sample is enumerated.For this survey, every operation in the U. S. has achance to be sampled either from the list and areaframe or the area frame alone. Names found in thearea frame during June that are not on the list frame(NOL) will be used in subsequent quarters to representthose operations which had no chance of list frameselection.

The list sample consists of several replications whichare selected each spring for use during the course of thesurvey year. These replications are rotated in and outfrom survey to survey to provide quarter to quartercomparability and to relieve respondent burden. Withthe rotation scheme used, farming operations may be

55

enumerated from one to four quarters in a particularyear's survey cycle.

TIME RELATED COVERAGE ERRORS

The samples for the Agricultural Survey are selectedin the spring of each year. Before some samples aresurveyed they will go "out-of-business". If an "out-of-business" operation is taken over by a new operation,this new operation must have a chance of selection.Any new operations taking over an "out-of-business"operation before June 1, will have a chance of inclusionin the area frame sample during the June AgriculturalSurvey. New operations starting up after June 1 canonly be accounted for by substitution procedures, sincethere is no complete area frame survey done after June.

These substitution procedures provide a means togive everyone a chance of being selected to assurepopulation coverage. Substitutions should be madewhen sampled units are "out-of-business" and the newoperator was not farming on June 1, but there isconcern that the procedures are not always executedproperly and all needed substitution is not being done(Jones 1988). Furthermore, substitution only occurswhen an operation is completely "out-of-business ". Ifan operation sells off only part of its land to a newoperator. that operation is not eligible for substitutionand does not have a chance of selection (Dillard 1993).NASS is currently researching how effectivelysubstitution procedures are being followed and theimpact of the substitution process on survey indications.

For the follow-on quarterly surveys of September,December, and March, about 40 % of the sample isfrom new replicates, with the remaining from oldreplicates that were surveyed in a previous quarter.For old replicate samples only those operations thatwere in business in the previous quarter will besurveyed in a following quarter.

Figure 1 shows the percentage of active samples fromold replications that were coded "out-of-business".While over the course of time it is natural for someoperations to go "out-of-business" , the percentage codedas "out-of-business" is questionably high. It is doubtfulthat all operations so coded actually went "out-of-business" since the earlier quarter contact; some may bemiscoded and others may have been refusals in aprevious quarter.

This study looked at the errors of reporting andcoding "business" status and their effect on coverage.While some operations legitimately go "out-of-business"between quarters, and these can be substituted for, asubstantial number of changes from quarter to quarterare errors in coding. For example, an operation is

Page 60: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

PERCENT OUT OF BUSINESSFrom Active Old Replications

, ..

Figure 2

'0

Dec-.betMonth of SUrvey

Cyde -.,", II1II.112

o

20

-Z6

CHANGES SINCE JUNE (923=1)Out of Business From Active Old Reps

15 .....

December

Month of Survey

Cycle Veer

• t 181 Ii1112

-..7SS

3Z, .-o

Figure 1

coded as ·out-of-business· in a current quarter but ·inbusiness· for a previous quarter, when in fact it shouldhave been recorded as ·out-of-business" during the firstquarter because the sample unit was a landlord. Theconverse can also happen when an operation is coded as"out-of-business" when it is really in business, since itcontinues to have potential for agricultural production.

In addition to being coded as "out-of-business·,questionnaires are coded as to whether the sampledoperation has changed since June 1. When an operationhas gone ·out-of-business" since June 1 item code box923 on the face page of the questionnaire is coded a 1.Figure 2 shows the plrprisingly low percentage of ·out-of-business· operations from active old replicates thatwere coded as a change since June 1. Since all oldreplicates were reported in business during a previousquarter, we would expect nearly all current survey"out-of-business" reports to be changes since June 1.Therefore, if the current survey coding is correct, mostoperations were reported erroneously during theprevious quarter. However, it is believed that code box923 is frequently left uncoded. The coding of this boxmay be overlooked for old replications in part becauseit does not need to be coded for new replications.

Any operation that is reported as "out-of-business" isnot surveyed again during that year's survey cycle. ByNASS's definition, an ·out-of-business· operation does.not have any agricultural commodities and has nopotential for agriculture during the rest of the year.Therefore, if correctly reported, it will have nothing toreport in the following quarters and need not besurveyed. Each quarter more of these known zeros areaccumulated, which creates problems when an operationis misreported as "out-of-business." State StatisticalOffices (SSO) are instructed to review the known zerooperations, but since not all are enumerated again some

previous survey errors may go undetected. Anyundetected misreporting of business status will cause adownward bias in the indications.

ANALYSIS OF THE DATA ADJUSTMENTFACTOR

In NASS's Agricultural Surveys, the DataAdjustment Factor (DAF) adjusts reported data forduplication and eliminates any positive data foroperations that should not be summarized. Undernormal situations the DAF is one, but it can have othervalues between zero and one. Common situationswhere the DAF is not one are: 1) an operation isduplicated in the same stratum (DAF=.5), 2) anoperation is duplicated in a higher stratum (DAF=O),and 3) an operation is ·out-of-business" (DAF=O).Table 1 shows the weighted (by the expansion factor foreach design stratum) average of the DAF during the lasttwo cycles of the Agricultural Surveys. The pattern ofa decline is clear. One would expect to see somedecline as operations go "out-of-business", but theamount of decline is of concern since it can have alarge impact on survey results.

To determine the effect of the DAF on the expandeddata, analysis was done comparing June to Decemberexpansions (Tables 2 & 3). The effects of the DAF,reported data, and the tract/farm weight factors wereseparated to assess the magnitude of each. This wasdone by calculating the normal June expansion, thenusing the information from those reporting in Decemberto recalculate the June expansion. For example, theexpanded data for an operation that was in business inJune but not in December, would be positive in June

56

Page 61: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

Table 1. Average Data Adjustment Factor

Cycle Month of SurveyYear I I IJune September December March

1991

1992

.926

.946

.899

.933

.871

.907

.849

.87

Table 2: Data Adjustment Factor (DAF) Effect on the Com Planted Acreage Expansion for Survey Years1991 and 1992.

Factor June to December Comparable Reports for Factor

1991 1992

Ratio June Difference Difference Ratio June Difference Differenceto Dec. June - Dec. as % of US to Dec. June - as % of US

(000) Dec. (000)

DAF .95 -1,554 -2.0 .96 -1,108 -1.4

List Data .99 -192 -0.2 1.00 -63 -0.1

Area Data and .99 -95 -0.1 1.07 464 0.6Weight

Table 3: Data Adjustment Factor (DAF) Effect on the Total Hog Inventory Expansion for Survey Years1991 and 1992.

Factor June to December Comparable Reports for Factor

1991 1992

Ratio June Difference Difference Ratio June Difference Differenceto Dec. June - as % of US to Dec. June - as % of US

Dec. (000) Dec. (000)

DAF .95 -1,271 -2.3 .98 -615 -1.0

Data .97 -685 -1.2 1.03 807 1.4

Area Weight .99 -122 -0.2 .99 -99 -0.2

and zero for the recalculated June expansion with theDecember information. Comparable reports for aparticular factor had to have usable factor informationfrom both the June and December surveys.

57

Additionally, comparable reports for data and weighthad to be in business both quarters. For the complanted acreage expansion the area data and tract/farmweight factors can not be separated. because in June

Page 62: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

37 Previously refusal and status not determined15 Partner reported in higher strata12 Partner reported in same strata11 June with potential only.8 Landlord only: incorrectly reported in previous

quarter7 Turned over to someone else7 Sold farm6 Name on label does not farm5 Reported crops or livestock earlier, and

reported none now

write out the reasons that operations changed theirbusiness status to ·out-of-business". Table 4 IS acompiled list of the reasons from four states. Themost common reason was that incomplete informationwas obtained during the prior survey, because therespondent either refused or did not provide informationabout a partner involved in the operation.

Several of the reasons for operations being coded as'out-of-business" are related to the (small) size ofoperations and to whether they have agriculturalpotential. NASS defines as "out-of-business' anoperation which has no potential for agriculturalinventory or production during the remainder of thesurvey year. With this definition, no operation withpotential for agricultural commodities should be codedas ·out-of-business". While these operations may havenothing to report for any particular quarter they mayhave agricultural inventory or production during asubsequent quarter.

From the Table 4 list we can not tell directly whetherthe change in business status occurred after June 1 orwas simply not picked up during a previous quarter.We can presume that some reasons, like 'landlordonly', reflect situations which were not picked up in aprevious quarter. Others, like 'sold farm', mayor maynot represent actual changes since June 1. If thechange occurred after June I then the selected unitwould be a candidate to be substituted for. If there isnot an actual operation change, then there is a mistakein one quarter or the other. This may result from therespondent failing to answer correctly, some recordingerror, erroneous office coding, or one of many otherpossibilities.

only tract data are reported while in December onlyfarm data are reported. For total hogs, farm data arereported in both June and December, so comparisonsbetween June and December of both data and weightscan be made.

From Tables 2 & 3, we can see that in 1991 theOAF factor had a greater impact upon the difference inexpansions between June and December than did thedata or the weight. For example, the OAF factorresulted in a decrease in the U. S. expansion of 2percent for com planted acreage while tbe list/area dataand weight factors decreased the expansion by only 0.2and 0.1 percent, respectively. The situation for totalhog inventory was similar, with the OAF decreasing thehog expansions by 2.3 percent. The size of thedecrease due to the OAF factor is larger than thecoefficient of variation for both estimates, illustratingthe substantial effect the OAF has.

When we look at the 1992 analysis in Tables 2 & 3,we see that the effect of the OAF is about one half thesize it was in 1991. This is encouraging, but the reasonfor the change in results is hard to determine. It ispossible that training to make people aware of the OAFconcerns has had a positive impact. One possiblereason for the drop is the new list samplingunit/reporting unit association procedures, which half ofthe states used in 1992. These new procedures forassociating reported data with sampled list names arecalled "operator dominant," as compared to theprevious procedures which are referred to as 'operationdominant.' To ~ if this procedural change reducedthe DAF impact, tbe effect of the OAF was comparedbetween the two groups of states. Analysis showedthere is only slight evidence that the DAF effect wassmaller in the group witb the new list dominantprocedures.

To learn why operations were being coded as "out-of-business' we began to collect reasons. Observationsmade in Missouri during June 1992 were used tocompile a preliminary list of these reasons. This listwas used in Kansas during the December 1992Agricultural Survey to code all questionnaires for whichthe reporting unit was coded 'out-of-business' (i.e. itemcode 921 =9). All old replications so coded were inbusiness a previous quarter, while new replicates hadnot been surveyed. The reasons to be used in thecoding were designed to differentiate between thesituations expected between old and new replicatesamples. The resulting list of reasons, while a startingpoint, turned out to be inadequate since too manyreasons were grouped as ·other.·

To improve upon the reason coding, listings weresent to selected states after the December 1992Agricultural Survey. State office personnel were to

58

Table 4:

NumberTimesOccurred

Detail of Reasons for Old ReplicationsCoded as "Out-of-Business"

Reason

Page 63: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

4

4443333

2

2222

221111

Minor crops or a few livestock only inprevious surveyTurned over to sonDeceasedRetiredLand is now idleValid "out-of-business" (reason unknown)Box 921 coded in error in current surveyLand is now rented, operated it previousquarterCRP operator which should not be coded ·out-of-business"Miscoded multiple operationsOperator lied on previous reportFarm operated by someone elsePreviously reported as 2 operations, actuallyonly 1Name correction on area frame, now OLPartner strata boxes coded incorrectlyChicken contractor onlyWorks on another farm onlyWrong name collected on June tractGrain Co. only

DISCUSSION AND RECOMMENDA nONS

year workshops. Statisticians in each state office couldcompile a list of reasons why some of their operationswere coded "out-of-business". This list could then bethe subject of small group discussions, probing forsolutions.

I recommend the coding scheme for the "changesince June box" be modified to improve the accuracy ofits coding. Procedures that would require it be codedfor all "out of business· operations would prevent itfrom being ignored. Once accurate information isobtained, ratios to a previous quarter could excludeillegitimate changes.

Another way to reduce the number of old replicatesamples inappropriately being coded as ·out-of-business· is by using historic data. When a respondentresponds that they do not have the items of interest, wecould then verify that they no longer have the itemsreported previously. This would be especiallybeneficial on CATIICAPI.

I recommend we look more closely at the ·out-of-business" operations and assess whether datacompensation is being realized through the use of thecurrent substitution procedures. This is the thrust of aseparate rest:arch activity currently being addressed inNASS.

There are many causes for the DAF decline. Someof the decrease is valid and expected since operationswill always be go~ "out-of-business" , but some is dueto survey error .. The many causes increase thecomplexity of determining what needs to be done. Theevidence suggests that the DAF decrease is large,meriting further analysis. Education and awareness canreduce errors. Procedural changes in coding todistinguish the difference between reporting errors andvalid changes may provide better indications.Collecting more reasons for operations coded as "out-of-business" may give further insight, while measuringand adjusting for the DAF and the use of ratio estimatesbased on operations whose DAF did not change mayneed to continue.

There already have been efforts to educate peopleabout the DAF. During the 1991 Midyear SurveyTraining, a session was conducted which provided DAFaverages and comparisons between June and Decemberexpansions. This awareness may have made adifference since the decrease in the DAF in 1992 wasabout one half what it was in 1991.

Based on these results, I recommend continued,enhanced training with each state to examine theirunique problems and further reduce the DAF dilemma.This education could be done during the advanced mid-

59

BmLIOGRAPHY

Dillard, Dave. (1993) "Design Specifications toStandardize Survey Procedures." National AgriculturalStatistics Service, United States Department ofAgricultural, February 1993.

Jones, Ned. (1988) ·QAS Substitution Analysis."Estimates Division, National Agricultural StatisticsService, United States Department of Agricultural,NASS Staff Report Number 5MB-88-04.

Page 64: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

UNBIASED ESTIMATION IN THE PRESENCE OF FRAME DUPLICATION

Orrin Musser. National Agricultural Statistics Service. USDAResearch Division. Room 305, 3251 Old Lee Hwy, Fairfax. VA 22030

INTRODUCTION

Survey organizations which conduct surveys on anongoing basis devote much effort and expense to themaintenance of their sampling frame. Estimation ofpopulation parameters may suffer from two main typesof frame deficiency: incomplete population coverageand duplication. In this paper we will focus on theproblem of duplication. with emphasis on thecomputation of correct inclusion probabilities as ameans to achieve unbiased estimation.

Duplication in the sampling frame is a seriousproblem which undermines the assumption of knowninclusion probabilities for each population element. Fora large sampling frame. while it may be too costly todetermine all duplication in the frame, it may bereasonable to assume that for a given populationelement it may be possible to determine all duplicatesin the frame. If so. then for many sampling designs.unbiased estimation is possible.

A sampling frame is a device which associates acollection or list of sampling units with a fmitepopulation of elements. It is helpful to formallydescribe the relationship between the sampling frameand the population. Suppose we have a population U ={EI' ~, ... ,EI:. •••• EN}' a collection of N elements Ek anda sampling frame F = {FI' F2•...• F~ ... ,F~, acollection of M sampling units Fl' For each unit Fj andeach element ~, let the indicator variable Oik bedefined:

And let Mk = E i Oill: be the count of frame units whichrepresent or ·link to· population element k. We willcall the collection or set of frame units linked to·population element k. link-group k. Duplication existsin the frame when there are some population elementswhich are linked to more than one frame unit. that isMk > 1 for some k. We assume that while Mk'S areunknown (difficult and lor expensive to determine for alarge entire population) Mk can be determined exactlyfor a particular unit k and thus for a sample. Assumea simple random sample of size m without replacement.We will denote this sample of frame units by s. For

60

each sampled unit. we obtain a listing of all frame unitsthat link to it. This set of frame units represents asingle unique population element k. Since othermembers of this linkage group could have beensampled. it is possible to sample a population elementk more than once. If we think of our sampling assampling without replacement of population elements.we also obtain a sample sp which contains n( ~ m)distinct population elements. While each frame unit inour original sample s had ~ual probability of selection,each population element in our sample sp did not have~ual probability of selection, and thus estimators whichassume we are sampling population elements with ~ualprobability wilI be biased if there is duplication in theframe.

2. CORRECT INCLUSION PROBABILITIES

We know that the Horwitz-Thompson estimator:

is an unbiased estimator of the population total. Y.Thus if we can compute 1fk for each sampled element k.we can get unbiased estimates for population size andin general for any variable of interest, even in thepresence of duplication. The inclusion probabilities. 1fk•are straightforward to calculate if we know Mk for eachsampled unit. If we are interested in the probability ofselection for population unit k. or equivalently linkagegroup k. of size Mk' we use the fact that this is~uivalent to 1 minus the probability of selecting asample of size m from the frame such that no units oflinkage group k were selected. This is just the ratio ofthe number of possible samples of size m chosen fromthe (M - MJ frame units which do not incJude anymember of linkage group k over the number of possiblesamples:

Page 65: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

Example: Ifwe take a sample ofm=5 frame unit froma frame of size M= 100 in which there is duplication,what is the probability that our sample Sp of distinctpopulation elements contains a particular population unitk for which Mk = 1, 2?

For Mk = l(Population element k is represented onlyonce on the frame):

95!511001

This is true in general, for each population unit forwhich there is no duplication, i.e. Mk= 1, theprobability of selection is just mfM, or f, to be denotedas 'lr*. (Recall that since we may sample a givenpopulation element more than once, n, the number ofdistinct population elements sampled, is a randomvariable and 11"k ¢ nlN.)

Mk = 2: (Population element k is represented twice inthe frame)

tt=1_(958)

k (1~0)=1- 981 95151

93151 1001=1-~ 94

100 99=.0979797

Note that this probability is not double the selectionprobability for a population unit without duplication. Itis interesting to look at this ratio of selectionprobabilities in genera\. Looking at the general formulafor the selection probability when Mk=2. we mayexpress 1I"k approximately as a function of the samplingfraction, f = m/M:

61

(M-m) !mlMl

Thus the ratio rk = 1I".j7r* where Mk = 2, and 11"* is theprobability of selection for any population element forwhich there is no duplication, may be expressed:

Thus. as the sampling fraction approaches 0, rk

approaches two. As f gets large and approaches 1, r k

approaches 1. Thus as the likelihood of selection getssmaller. the bias due to incorrect assumptions of equalprobabilities is increased. This ratio rk is interestingbecause it expresses the degree to which the data Yk is"over expanded" due to the assumption of known equalselection probability. In our example where f = .05,the "pi estimator" would over expand Yk by a factor of.09797/.05 = 1.96. If N were 10(f= 1/2), then rk willbe approximately 1.5 and thus estimates which ignoreduplication will over-expand data for elements with oneduplicate by a factor of 1.5. If N were 10,000 then rk

would be essentially 2.Since we assume that we may determine Mk for any

population element k, then clearly we may compute 1I"k

for each sampled element and thus use the Horwitz-Thompson estimator to obtain unbiased estimates forpopulation totals and means. If we defme Yk= 1 foreach population element k, then we could obtain anunbiased estimate of N, the true population size. Thisis just the sum of the reciprocals of the inclusionprobabilities for the n distinct population elements in Sp.

This estimator is unbiased for N:

Page 66: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

The variance of the Horvitz-Thompson estimator of apopulation total Y is given by

and an unbiased estimator of the variance is given by

(Samdal, Swensson, and Wretman, section 2.8).These general formulae are very useful for this situationwhere duplication results in unequal selectionprobabilities.

The second order inclusion probabilities needed forthese formulae may be determined for the sample bythe following formula which uses analogous reasoningto that for the first order inclusion probabilities.

62

3. ALTERNATIVE STRATEGIES

One alternative "adjustment" for list duplication, andone that is currently used by NASS surveys, is thecommon survey practice of using a weight or dataadjustment factor to account for the effect ofduplication. If a population element k appears on thesampling frame Mk times, then when sampled the datais multiplied by 1/Mk• Even if the same populationelement appears multiple times in the sample, everysampled unit reports. Cox (1993) describes thisprocedure as an adjustment of the weight "associatedwith sampled frame units to reflect the multipleselection opportunities for the desired population unit. "This adjustment obtained by multiplying the samplingweight M/m, and the adjustment 11M\:, results in anoverall weight of M/(m*MJ. This new weight is not,in general, equal to the reciprocal of the probability ofselection. As sho'Ml earlier the ratio rk depends on thesampling fraction. Nonetheless, this procedure doesresult in unbiased estimation.

Suppose x is a data item with Xk being the data foreach true population element k. Then for each frameunit I which is linked to element k, we define Yld =X/Mk , I = 1 .. Mk. Thus we are letting each frameunit account for the proportion, I/Mk, of the data forpopulation element k. Clearly the total of the y's isequal to the total of the x's:

Thus a reasonable estimate for X would be

A •• MY=E -Yi .j-l m

Page 67: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

Note again that this sum is over the entire sample offrame units. This is clearly unbiased for X, since thisapproach is equivalent to a simple random sample withthe frame being the population. Thus Y is unbiasedfor Y = X.

This technique really obtains unbiased estimation byredefining the relationship of the frame to thepopulation. If a population element appears M. timeson the frame, then each of those Mk records accountsonly for the proportion 11M. of the data for populationelement k. This eliminates the duplication of the data.

Another approach used to obtain unbiased estimationin the presence of frame duplication is to define a·unique counting rule' which links each populationelement to a single frame unit. An example would be to

63

link each population element k to the frame unit in Mk

with the largest frame id, etc. In this case, populationelement k is sampled only if this particular frame unitis selected_ Thus, if duplication were detected afterdata collection, there could be loss of data.

REFERENCESCox, B. (1993), ·Weighting Class Adjustments for

Nonresponse in Integrated Surveys: Framework forHog Estirn:ltion,· NASS Research Report SRB-93-03.

Lessler, J., Kalsbeek, W. (1992) Nonsampling Error inSurveys, New York: John Wiley.

Samdal, C., Swensson, B., Wretman, J. (1992),Model Assisted Sun'ey Sampling, New York:Springer- Verlag.

Page 68: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

- --~ ~---

ESTIMATING AGRICULTURAL LABOR TOTALS FROM AN INCOMPLETE LIST FRAME

D. Scot Rumburg, Charles R. Perry, Raj S. Chhikara', and William C. Iwig, USDA/NASSCharles R. Perry, Research Division, 3251 Old Lee Hwy., Room 305, Fairfax, VA 22030

KEYWORDS: Post-stratification, Multiple Frame, ratioestimate

ABSTRACTThe National Agricultural Statistics Service previouslyhas conducted monthly labor surveys to estimate thenumber of total agricultural laborers. It employs amultiple-frame approach, using both a list and areaframe. The list frame is highly efficient in sampling thetarget population of agricultural operations but does nothave complete coverage of that population. The areaframe covers all agricultural operations but is relativelyinefficient in sampling those operations. An approachutilizing population count estimates from an initial areasample and post-stratified estimates from the monthlylist sample has been investigated as a method forimproving the precision of the survey estimate whilereducing area frame respondent burden. Preliminaryresults indicate that survey to survey ratios of post-stratified Iist-only estimates can produce estimateswhich are comparable to current multiple frameestimates in both level and variance.

1. INTRODUCTIONA multiple frame approach, employing both a list andan area frame, has long been a cornerstone for manyof the agricultura,l surveys which are conducted bythe National Agricultural Statistics Service (NASS).Area frame responses often account for a majority ofthe total variance for multiple frame estimates butonly a small part of the total indication. For thisreason and others, it was recommended that a studybe initiated to investigate alternatives to the currentmultiple-frame approach for administering surveys.A post-stratification approach whereby listrespondents could be used to represent the entiretarget population, was recommended forconsideration (Vogel, 1990a, 1990b and 1991). Kott(1990a and 1990b) elaborated on the proposal andoutlined the two model-based estimators, theirvariance and potential bias. Perry, et al. (1993)provide an estimation method for the variance of ageneralized post-stratification estimator based on itslinear approximation using a Taylor Series expansion.Survey data from the California Agricultural LaborSurvey series from July 1991 through June 1992were used to investigate the alternative estimators.

Raj S. Chhikara is Professor, Division of Computing andMathematics, University of Houston· Clear Lake,Houston, Texas 77058

64

2. METHODOWGY2.1 NASS Survey MethodologyNJ\SS conducts nl1D;1c:roussurveys with regard toagncultural commoditIes and related subjects. Themajority of these surveys employ a multiple-frame(MF) methodology using both a list frame and anarea frame. The list frame is stratified based onknown data about agricultural operations with regardto the survey item(s) of interest. The list frame isnot a complete listing of all agricultural operations.For the 1992 survey year beginning in June, theentire NASS list frame is estimated to contain 56 % ofall agricultural operations (often referred to simply asfarms) and 81 % of all land in farms. The area frameis stratified based on the agricultural intensity of aregion. Unlike the list frame it has completecoverage of all agricultural operations in the U.S.

All reporting units (agricultural operations) in theJune area survey (JAS:A) are classified as eitheroverlap (OL) or as non-overlap (NOL) with the listframe. All operations found to be NOL are dividedinto several sampling pools to be used in follow-onsurveys for the year. The list frame takes precedenceover all OL operations when a MF estimate iscalculated. A MF estimate is obtained by summingthe list frame sample component estimate with thearea frame's NOL sample component estimate. Inmost cases, the list frame provides about 75 % of thetotal MF estimate while the NOL component addsonly the remllinine 25%. However, the NOLestimate is often a major contributor to the overallvariance of the MF estimate, due to both the highvariability of sampled units for many commoditiesand the sizable sample weights associated with smallsampling fractions. The post-stratification approachinvestigated in this paper is an attempt to improve thereliability of the NOL component of MF estimates.

2.2 Post-Stratification MethodologyThe proposed list-only estimator based on modelingof the NOL population represents a departure fromthe present NASS survey design and estimationmethodology. Three factors motivate use of list onlyestimators: (1) the NOL sample units are highlyburdened, (2) the current NOL estimates are often

Page 69: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

unreliable, and (3) the presence of NOL sample unitsincreases the complexity of a survey.

Post-stratification for the Agricultural Labor Survey(ALS) was based on three classification variables:(1) The peak number of agricultural workers anoperation expected to have over the course of a year(Peak), (2) the annual farm value of sales foragricultural goods (FVS), and (3) the type of farmoperation (Ffype). These classification variableswere selected based on their ability to describedistinct post-stratum populations and to correlate withthe number of hired agricultural workers, which isthe variable of interest. Basic strategy to obtainhomogeneous post-strata populations involvedselecting class boundary values for the two numericalclassification variables (Peak and FVS), and creatingcombinations of the third categorical variable(Ffype). No more than twelve total post-strata couldbe created in order to maintain adequate samplecounts for all post-strata across all surveys.Depending on cutoff values and FType groupsselected, fewer post-strata could be constructed. Anattempt was made to maintain a minimum of 20respondents per post-stratum for all post-strata,though this was not always possible.

2.2.1 Post-Stratified EstimatorsPost-stratification is often used as a variancereduction tool in a design unbiased survey. It canalso compensate for the undercoverage of a targetpopulation by a plU1icular selected sample. Both usesare employed for the approach explored in this paper.First it is hoped more homogeneous populations areproduced with post-strata, resulting in variancereduction. Second, the list frame is used exclusivelyas the selected sample for follow-on surveys,resulting in undercoverage (actually non-coverage) ofthe NOL.

Once selected, the list sample is post-stratified toobtain post -stratum estimates. In the case ofunweighted list responses, the estimator of thecharacteristic of interest Y is of the form:

(Eq.l)

wherebk IJ," ktll post-stratum population size

estimate [rom the June survey (JAS:A),nk" ktb post-stratum sample size, andUk" the set of all useable sample reporting

units in the ktb post-stratum.

wherewJ •• itll sample reporting unit weight., andotheI variables are defined as in Equation J 65

For each choice of 9P5 one can compute a ratio andratio expansion based on a combined ratio.

3. RESULTS3.1 Preliminary Research - Simulation StudiesSimulation studies provided a theoretical perspectiveon several aspects of the post-stratificationmethodology, Approximate variance estimates werederived and evaluated. These numerical evaluationsshowed that the performance of a post-stratifiedestimator is largely a function of the sample size usedto estimate the post-stratum sizes, the sample sizeused to estimate the post-stratum means of thevariable of interest, and the ratio of these two samplesizes. The relative efficiency of the post-stratifiedestimators all increased as the ratio of the two samplesizes increased. Given the sample size for thefollow-on survey, the sample size for the base surveyshould be at least twice as large for gains inefficiency. Moreover, for post-stratification to beeffective, the entire sample size in the follow-onsurvey should be at least 50 (preferably much larger)with the sample size in all post-strata at least 10(preferably '20 or more).

3.2 Comparison of List and NOL RespondentsTable I shows that the NOL has a lower averageestimate within nearly all post-strata for California,whether one compares weighted or unweightedresponses, Particularly troubling are the large FVSpost-strata with open-ended peak workers (5 or more)and specifically the fruit, nut and vegetable post-stratum. The few NOL respondents which fell intothis category had many fewer hired workers than didtheir list counterparts. The high FVS post-stratum,with Peak 5+ and Ffype Crop & Mise produced alarger NOL average hired workers than did the listand was due to one large NOL respondent reporting391 hired workers.

Table 1 also characterizes the difference betweenweighted and unweighted averages. Unweightedaverages are consistently higher than weightedaverages for both list and NOL respondents fornearly all post-strata. Since operations with largernumbers of hired workers are sampled at a higherrate, and because operations with larger numbers ofworkers tend to represent fewer number of farms, thesampling weights are negatively correlated with thenumber of hired workers, the variable of interest.This situation occurs even within post-strata. Thenegative cOlTelation of weights and number of hiredworkers within post-strata suggests that theunweighted average will tend to overestimate thenumber of hired workers per farm for both frames.

Page 70: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

TABLE 1. Counts and Mean Number of Hired Workers Within Post-StrataFor the California July 1991 Agriculture Labor Survey

SurveyPost-strataDefinitions CountsFVS FT~ Peak lIst NOlS1-50K Crops&Misc 0-4 49 70S1-50K Crops&Misc 5+ 1 1S1·50K Veg,Frt&Nut 0-4 70 79S1-50K Ve~,Frt&Nut 5+ 28 15S1-50K Dalry,Poultry,GrnHse&Nursry0-4 4 0S1-50K Dairy,Poultry,GrnHse&Nursry5+ 0 0Crops&Misc 0-4 56 35Crops&Misc 5+ 59 17Veg,Frt&Nut 0-4 57 15V~,Frt&Nut 5+ 249 38Dalry,Poultry,GrnHse&Nursry0-4 30 0Dairy,Poultry,GrnHse&Nursry5+ 63 5 22.90 16.30 33.70 19.40Cell counts and means for the weighted and unweighted response values byframe. Note that the NOL cell averages tend to be smaller than the list averagesand that the weighted cell averages tend to be smaller than the unweightedaverages.

IleightedMeanlIst0.280.000.212.031.36

UnweightedMeanlIs t NOl0.24 0.240.00 0.000.17 0.136.89 0.400.75

S50K+S50K+S50K+S50K+S50K+S50K+

1.117.900.7015.301.15

NOl0.120.000.110.27

0.3915.100.485.29

0.9313.800.7738.801.30

0.6935.100.6716.30

3.3 Overall Performance of the Estimators3.3.1 Post-Stratified EstimatorsThe combinations provided by selecting unweightedor weighted averages and an ability to select for list-only, NOL-only or both respondent types, producedsix possible post-stratification estimators to study andevaluate. The NOL-only estimators were used onlyin co~unction with list-only estimators to providecomparative diff~nces between the two frames ona state level basis. ·The MF post-stratified estimatorswere used to evaluate changes in variance due to list-only post-stratification.

Not surprisingly, it was found that the unweightedestimator consistently overestimated the actual laborforce by a large margin (recall Table 1). Theestimators using unweighted survey values producedthe largest biases of all the estimators. Use ofweighted survey values produced adequate, thoughsomewhat more variable, estimates when comparedto MF survey design direct expansion (MF DE)estimates. Since much of the post-stratificationinformation is included in the list survey design (FVSand FType) and because the bulk of the ALS estimate· .comes from the list, it is not surprising the weightedMF post-stratified and the MF DE estimates arecomparable.

Figure 1 depicts the level of bias produced by usinga strictly unweighted post-stratified estimator andcompares survey estimates across the 1991 ALSseries year. For this and all succeeding graphs ofthis type, the vertical length of each estimaterepresents one standard error from the survey

66

estimate in either direction. In some extreme casesthe length in one or both directions has beentruncated.

Also not surprising, given the post-stratum meandifferences as shown in Table 1, it was found that thelist-only estimator consistently overestimated theactual number of laborers while the NOL-onlyunderestimated the actual labor force number. Figure2 illustrates graphically the problems inherent in theweighted list and NOL-only post-stratified estimators,again comparing survey estimates to the Agricultural

Figure 1.MUtipIe Frare W8\tdOO vs. \.kMeirjlted Posl-SlraIifia:j EstinaIa"

( CaiICXTia 3 -Waf 0assificaIi0n )700.000

H 600.000

IR 500.000

I I] IEo 400.000

W Io 300.000R

~ ~ ~ ., -" ~ ~~ 200.000 ~R 1li "1\S 100.000 ~

- IlQO.RO Ilf" P.S. (l-'/WTD) B If" D( f ur P.S. (WTD)

IJl.It AL(;91 SEPil OCT91 NCN91 IXCI\ J,lH92 f£892 YAlt92 APR92 UA'.'~.i12 A1HS12

(Vertical SymbOlLe"'Jlh Rep<•• enls TwoStandard Err",,)

Con,rparisonof Weighted versus Unweighted Post-StratifiedEstiinators.

Page 71: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

Statistics Board as well as to the combined MFestimate across the 1991 ALS year.

Overall, list-only post-stratification CVs forCalifornia were mostly comparable with original MFDE CVs. This occurs for the most part because list-only post-stratified estimates generally are larger thanthe survey indication and have more varianceintroduced through the use of estimated Junepopulation counts. This leaves the overall percentageerror of the total (CV) roughly equal to the MF DECV. One must remember however, that thecomputed variance underestimates actual variance byas much as 10% resulting in a CV increase ofapproximately 5% since the v"f:T = 1.049. Forpurposes of this report however, all CVs displayedwill be the actual value computed with nocompensation for bias. For California, the averageCV for the weighted list-only post-stratificationestimate for the survey year 1991 averaged 15.6%.This compares to an average MF DE CV forCalifornia of 14.2%.

Figure 2.Usl-CJrt.j w. NOL-CXtf .m M~ Fnme R:lSt - SlralIied EslirIIIcI

( Caibria - 3-Wa{ ~::>I;) I )

shown in Figure 3 alongside the actual MF DE andthe Board number for that month. The weighted list-only post-srratified survey total ratio tracks well withthe Board estimate and, in fact, seven of the elevenratio expansion estimates obtained for California werecloser to the Board estimate than the MF DEindication. The average CV for California was11.3% for the list-only ratio expansion estimatewhich was less than the MF DE average cv of14.6% over the same eleven surveys.

Figure 3.Us! - Oriy vs. MIiI1jje Frarre Post - Stratified Ratio Estimator

( Ca/ibnia 3- ~ 0assilicaIi0n).lOO.OOO

H,RE 200.000owoRIC 100.000ERS

- 9()j,R() lUSH S RAno B IoE[)( f UfP.s. RAllO

~1~I~t~'~1_2~~~2~2m2

-1lOIoRll IUSTP.s. BIEP.s tHOLP.5.

Comparison of the Iisl-Dnly, NOI.AJnly and Multiple FramePosl:Stra/ifieil Estinuztors.

ClJftM liI..Re IIJNIK (RI01JEl) 10f'H!,OJS Cl.MlBI)

Comparison of the Lisl-Dnly and Multiple Frame Pos/-Stratified RatiOEstimaJors.

3.3.2.2 Survey Design List-Only RatioFigure 4 compares the three-way post-stratified list-only combined ratio expansion with the survey designlist-only combined ratio expansion. The post-stratified ratio estimator uses a weighted ratio,accounting for differences in farm numbers acrosspost-strata. It is this difference which makes thepost -stratified combined ratio estimator a moreaccurate estimator than the survey design combined

H,-AEo JIXl,COO

WoA 2OO,COOICER 100 DClOS .

3.3.2 Ratio EstimatorsPost-stratified combined ratio expansion estimateswere calculated using MF and list-only data. Inaddition, a combined survey design ratio expansionestimate was computed using list-only data. Elevenmonthly estimates were produced over the surveyyear for each estimator since a ratio estimate for July1991 was not feasible. The ratio estimators wereproduced using only matched useab1e reports fromboth surveys.

3.3.2.1 Post-Stratified RatioRatio expansion estimates were obtained using acombined post-stratified ratio estimator and the three-way post-strata classification scheme. Post-stratifiedsurvey total estimates using either list-only or MFrespondents were constructed. and the results are

Figure 4.F1:JsI- Stra!fsd Usl-Crit w. &IvIft Usl- CWt RaOO Esli'r1alD'

(CSlbria' - 3-Wa{ ~)400.000

H

~ JOO.OOl)

EoW200000oRK

~ 100,000

S

- ,l(Wlt) I UST ,.S RAllO B •••. oc \ UST 0( RAm--,.- , I l • t I , I I I

AUC;I stPil OCllll ...,...., O(CIl ..wGI2 1'[812 MilM2 ~ ••.•."2 "Il"lNQ2

QJftNI !lIMY IoI:lN1H (Am:Bl1O Pf£WUl ClM1EA )

Comparison of List-Only Post-Stratified and List-Dnly SurveyDeslgn Ratio Estimators.

67

Page 72: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

ratio estimator. The two estimators have about thesame precision.

3.4 Summary of ResultsThe combined three-way post-stratified list-only ratioestimator seemed to provide a viable estimate for thetotal number of hired workers indication. Thoughthe post-stratification model is somewhat complexand would have to be optimized for each state orregion, it does fulfill the objective of using a samplewhich ignores a subgroup, sp~ifically the NOL.

4. CONCLUSIONSFor the Agricultural Labor Survey, there appear tobe differences in mean values of list and NOLrespondents within post-strata. Also, the sampledesign produces negative correlations between thesample weight and the response within post-strata.These two factors make the unweighted post-stratifiedestimator biased. Though bias is reduced in the caseof the weighted post-stratified estimator, differencesbetween weighted list and NOL respondents still existwithin post-strata. Ratio expansion estimators,however, appear to avoid these problems and mayhave potential within the NASS framework.

The list-only combined ratio expansion estimatorusing three-way post-stratification appears to modelthe NOL adequately, while reducing variances onaverage. However, development of post-strata forindividual states and regions would be a timeconsuming job and. would involve reworking of thecurrent survey summary system. Additionally, anestimator that uses only list respondents will probablybe biased and must be cautiously approached andmonitored if any list-only estimator were to becomeoperational.

One problem with the post-stratified estimatorsinvestigated here is the estimated farm counts fromthe JAS:A. These counts are estimated using thearea weighted estimator 'and tend to be quite variable.The inaccuracies can be corrected to some degree byusing the MF population estimate. Any variability inthe counts translates to higher overall variances of thepost-stratified estimates. The post-stratified ratio.estimators reduce the magnitude of this problem, butmore accurate population estimates would surely helpthese estimators also.

5. REFERENCESFlores-Cervantes, Ismael; 1990a; SimulationResults: A Preliminary Study of the "Strawman"Estimator and Derived Formulas; November 21,1991; NASS Internal Correspondence.

68

Flores-Cervantes, Ismael; 1990b; Effect of anIncomplete List on the Design Based "Strawman"Estimator; November 26, 1991; NASS InternalCorrespondence.

Flores-Cervantes, Ismael; 1990c; Estimating theNumber of Farms in a Post-Stratum using theJune Agricultural Survey; December 2, 1991;NASS Internal Correspondence.

Kott, Phil; 1990a; Some Mathematical Commentson Modified "Strawman" Estimators; NASSInternal Discussion Paper.

Kott, Phil; 1990b; Comparing "Strawman" and"Weighted Tract" Estimators in June; NASSInternal Discussion Paper.

Nealon, John Patrick; Review of the Multiple andArea Frame Estimators; SF&SRB Staff Report No.80; Wash., D.C. March 1984.

Perry, Charles R., Chhikara, Raj S., Deng, Lih-Yuan, Iwig, William C., and Rumburg, D. Scot;1993; Generalized Post-Stratification Estimatorsin the Agricultural Labor Survey; SRB ResearchReport NumberSRB-93-04, Washington, D.C., July1993.

Rumburg, D. Scot, Perry, Charles R., Chhikara, RajS., and lwig. William C.; 1993; Analysis of aGeneralized Post-Stratification Approach for theAgricultural Labor Survey; SRB Research ReportNumber SRB-93-05; Washington, D.C., July 1993.

NASS Program Planning Committee, 1992; Minutesof the Program Planning Committee of theNational Agricultural Statistics Service; July 14-16,1992.

Turner, Cheryl L.; Evaluation of EstimationOptions for the Monthly Farm Labor Survey; SRBResearch Report Number SRB-91-12; Washington.,D.C., December 1991.

Vogel, Frederic A; 1990a; "Strawman" Proposalfor Multiple Frame Sampling; October 3, 1990;NASS Internal Memo.

Vogel, Frederic A; 1990b; "Strawman" Proposal;November 20, 1990; NASS Internal Memo.

Vogel, Frederic A; 1991; A New Approach toMultiple Frame Sampling and Estimation; January16, 1991; NASS Internal Discussion Paper.

Page 73: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

AN ANALYSIS OF THE SAMPLING FRAME FORTHE CHEMICAL USE AND FARM FINANCE SURVEY

Susan Cowles and Susan Hicks, USDAINASSSusan Cowles, 200 N. High, Room 608, Columbus, OH 43215

The National Agricultural Statistics Service (NASS)has begun a pilot study of the Chemical Use and FarmFinance Survey (CUFFS). The CUFFS combines parts ofForm H of the Objective Yield Survey and the Cost ofProduction Survey (COPS) versions of the Farm Cost andReturns Survey (FCRS). Most NASS surveys utilize amultiple frame design - a combination of list and areaframes. The purpose of this analysis is to evaluate whetherthe multiple frame design is necessary for CUFFS. If not,then the sample will be selected from the list frame only.

1. Abstract widespread public concern for farming's effect on waterquality and the environment.

III. Why Select a List Only Sample for CUFFS?

The list frame consists of known farm operatorswhile the area frame consists of all land segments. Thearea frame is a complete frame and thus is used tomeasure undercoverage in the list frame. Farm operatorsfound in the area frame that are not represented on thelist comprise the NonOverlap Sample or NOL. Thereare three major advantages to a list only sample:

3) reduction in variances.

2) cost savings, and

I) reduction in respondent burden for the NOL,

LO=List Only MF=Multiple FrameAt the U.S. level, the NOL contributes about 15%

to total planted acres for major commodities, butcontributes about 40% to the total variance.

JunePlanted Acres

(in Mil)NOL List MF

12.1 68.9 81.09.6 52.0 61.62.4 16.6 19.0

CV%LO MF

.6 .7

.9 .91.7 1.9

Commoditv NOLCorn 3.1Soybeans 3.4Spring 2.4

Wheat

Respondent burden is a major advantage of a listonly sample. The NOL domain is relatively small dueto small area frame sample sizes and more complete listframes. However, the relatively small population ofNOL operators must be spread across all surveys, withthe result that some NOL operators must be interviewedfor multiple surveys.

The cost savings due to a list only sample are smallcompared to total survey costs. However, for lesscommon commodities the NOL produces few if anypositive operations. Thus, the cost per positive record ishigh. If the NOL domain is included, this cost could bereduced by screening operations by telephone for thecommodity of interest prior to interview. See table 1.Table 1

improve response rates.

The data that CUFFS would collect is currentlycollected through two other surveys: the Objective YieldCropping Practices Survey (Form H) and the Farm Costsand Returns Survey's Cost of Production Survey (FCRS-COPS). When, or if CUFFS becomes operational, itwould totally replace Form H and the COPS questionnairewould be shortened for crops targetted by CUFFS. Theshorter interviews for Objective Yield and FCRS-COPSwill reduce respondent burden for these two surveys.

Data quality is expected to improve for the COPSversion of CUFFS as a result of collecting informationclosely following harvest. Currently, the COPS versionsof the FCRS collects data six months after harvest. Abetter response rate is expected for the economic data dueto its association with chemical use· data. Farmers aremore willing to provide data on chemical use due to

improve data quality, and

II. Overview of the Chemical Use and Farm FinanceSurvey (CUFFS)

reduce respondent burden,

The CUFFS design consists of three phases. In thefir phase, operations are contacted to determine if theyha\ ~ the commodity of interest. In the second phase,pesticide and fertilizer use information is collected in theFall from operations that reported having the commodity.Finally, those same operations are recontacted thefollowing Spring to obtain economic data

The CUFFS was .c;levelopedby NASS in an effort to:

69

Page 74: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

For most commodities. the CV for a list only samplewould be smaller than the multiple frame CV. However,the decrease in variance comes at a cost and that cost isbias. A list only sample introduces an inherent bias intothe estimate by excluding some members of the populationfrom the sample universe. However. if farm operators inthe NOL domain are similar to farm operators on the listframe, then the bias may be minimal.

IV. Analvsis Study to Compare List vs NOL Estimates

The goal of the research was to compare chemicaluse data from the list and NOL domains to see if therewere significant differences. Ideally we would like tocompare list to NOL estimates from the CUFFSquestionnaire. However, the CUFFS pilot survey, whichwas conducted in Minnesota. used the proposed list onlysample design. thus the NOL component was notavailable. To obtain a proxy for CUFFS chemical usedata. we obtained Minnesota Form H data for com,soybeans and spring wheat. Form H data is area frameonly. The data was divided into overlap (OL) andnonoverlap (NOL) domains to allow comparisons betweenthe two domains.

The OL and NOL domains were determined byclassifying operations as OL or NOL to FCRS for 1991.The OL to FCRS group was further divided into groupsdetermined by whether they were in a strata being sampledfor CUFFS. If an operation was OL to FCRS and in aCUFFS strata, it was OL to CUFFS. All others wereconsidered NOL to ct.fFFS.

The Form H summary system was used to obtain themean rate of application per treatment and mean percentof acres treated for each active ingredient by domain. Wethen compared the estimates obtained between OL andNOL domains for the twelve most commoncommodity/chemical combinations.

sample design was more complicated than a SRS, but theeffect of this approximation should be a slightoverestimate of the variance, which we were willing toaccept.

Mean rate of application is estimated as:A Zd

Rd = =-Yd

where:Zd = average rate of applil:ationfor each commodity,

chemical combination in domain d

Yd = average number of treatmentsfor eachcommodity/chemical combination in domain d

For mean rate of application per treatment, wecalculated bootstrap-t confidence intervals instead of theusual t-test because of concerns about normality of thestatistic being tested. Using the bootstrap methodologywe constructed histograms of the distribution of thestatistic mean rate of application per treatment. Thehistograms suggested severe departures from normalityfor some statistics. See Rao and Wu (1988).

We selected 10,000 bootstrap samples from thecombined sample - OL and NOL domains combined.For each bootstrap sample we calculated the usual t-statistic for a difference. Then based on the distributionof the t-statistics we estimated the 5th and 95th percentilesof the t-distribution for each commodity/chemicalcombination. The confidence interval is defined as:

{ D - t.9so(D).D - t.oso(D) }

where:D RoL - RHoL - - from fuU sample

o(D) standard error of difference

V. Methodology t.os' t.9S percentiles of the bootstrap tdistribution

Percent acres treated is estimated as:A ndp =-

d Ud

where:d = OL or NOL domain

nd = number of positive responses in domain d

Ud = number of usable responses in domain d

The variances were calculated using the usualformulas for the variance of a proportion when the dataare obtained by a simple random sample. In fact, the

70

VI. Results for Mean Rates of Application perTreatment

Table 2 shows the mean rates of application pertreatment, the normal t-test for the difference and thebootstrap-t confidence interval.

The normal t-tests are included for comparisonpurposes. The bootstrap confidence interval reflects theskewness in the distributions of the differences while the

Page 75: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

OL NOL Normal BootstrapActive Rate per CV(R) Rate per CV(R) test CI

Commoditv Inqredient Treatment (%-l Treatment (%l t (diffl LL ULCorn Nitrogen 67.25 2.50 63.07 5.09 1.151 -2.05 9.90

Dicamba 0.32 2.79 0.25 5.21 4.598* 0.05 0.10*Atrazine 0.80 4.35 0.82 12.23 -0.215 -0.23 0.14Alachlor 2.24 3.70 2.63 12.18 -1.174 -0.91 0.20Metolachlor 2.14 3.18 2.31 5.18 -1.243 -0.38 0.07

Soybeans Trifluralin 0.77 3.15 0.81 5.09 -0.955 -0.13 0.03Imazethapyr 0.05 2.06 0.06 2.30 -1.820* -0.01 0.00Alachlor 2.60 3.26 2.54 6.89 0.310 -0.24 0.42Bentazon 0.69 5.44 0.76 8.10 -1.071 -0.19 0.06

Spring MCPA 0.29 4.91 0.30 7.76 -0.403 -0.07 0.03Wheat 2,4-D 0.26 16.78 0.31 14.45 -0.761 -0.15 0.06

-

Table 2

Bromoxynl1 0.24 5.09 0.19 18.97 1.148 0.01 0.21

Bromoxynll 37.1 21.2 1.866*

normal t-test relies on the asswned bell-shapeddistribution. Therefore, the bootstrap-t confidence intervalmore accurately reflects the true differences between theOL and NOL groups.

Using the normal t-test, two significant differenceswould have been found: Dicamba used on com andImazethapyr applied to soybeans. Their respective t-valuesare 4.598 and -1.820 which, in absolute value, are greaterthan the 1.645 critical value for a 90% confidence test.The bootstrap intervals show only one clear differencebetween the OL and NOL mean rates of application pertreatment of Oicamba on com. However, we could expectto fmd one or two significant differences, by chance, evenif no true difference exists based on a 90% confidenceinterval. We conclude that the data do not suggest adifference in mean rate! of application between the twodomains.

Table 3

PercentAcres

Active TreatedCommodity Inqredient OL NOL tCorn Nitrogen 97.0 98.3 -0.905

Dicamba 30.8 26.7 0.898Atrazine 32.3 29.3 0.644Alachlor 25.8 19.0 1.654Metolach .. 25.0 26.7 -0.382

Soybeans Triflur .. 46.6 37.0 1.959Imazeth .. 55.5 51.3 0.848Alachlor 10.0 12.3 -0.735Bentazon 12.4 10.4 0.647

Spring MCPA 67.6 51.5 1.639Wheat 2,4-D 27.6 51.5 -2.455

*

*

*

VIT. Results for Percent Acres Treated

Table 3 shows the results for a difference betweenthe OL and NOL domains for percent acres treated.

Of the twelve commodity/chemical combinations tested,four showed a significant difference. Also, the differencefor spring wheat treated with MCPA had a t-statistic of1.639 which is quite near the critical value of 1.645. Theabsolute difference for MCPA was 16.1 percentage points.If this difference is considered significant, all three of thespring wheat/chemical combinations show significantdifferences. Alachlor applied to com and Trifluralinapplied to soybeans also showed significant differences.

As with rate of application per treatment, based on a90% confidence interval one or two significant differencescould be expected, by chance, when no true differenceexists. However, because at least four significantdifferences were found, we conclude there appear to bedifferences in percent of acres treated between the OL andNOL domains.

71

IV. Conclusions

We looked at rate per treatment and percent acrestreated for twelve commodity/chemical combinations.For rate of application per treatment, the data do notshow a consistent statistical difference. However,several differences were found between OL and NOLpercent acres treated. While the data suggest somedifferences exist, we are reluctant to draw conclusionsfor the nation as a whole based on results for one state,for the following reasons:

Cropping practices vary by state.

Commodities vary by state.

Applications of chemicals vary by commodity.

The next phase of the research will examine 1992Form H data from Minnesota and Louisiana to determinewhether these results are consistent over time and acrossstates. We recommend delaying the decision about

Page 76: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

whether or not to proceed with a list only sample forCUFFS until that research is complete.

References

Cowles, Susan and Hicks, Susan (forthcoming), "AnAnalysis of the Sampling Frame for the ChemicalUse and Farm Finance Survey," NASS Staff Report.

Rao, J.N.K. and Wu, C.FJ., (1988) "Resampling Inferencewith Complex Survey Data," JASA, 83,231-241.

U.S. Department of Agriculture (1983): Scope andMethods of Statistical Reporting Service, PublicationNo.l308, Washington, D.C.

, .

72

Page 77: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

AGRICULTURE DATA SYSfEMS - A V.S./CANADA COMPARISON

Robin O. RoarkUSDA/NASS, International Programs Office

So. Agriculture Bldg Room 4132; Washington, D.C. 20250-2000

Key Words: NASS and AgDiv

The CanadalU.S. Free Trade Agreement hasopened the border for more agricultural trade betweenCanada and the United States. It will also increase theneed for agricultural data and comparison of statisticsfrom each country. The Agriculture Division (AgDiv)of Statistics Canada (STC) provides a wide array ofagriculture statistics for Canada just as the NationalAgricultural Statistics Service (NASS) provides for theUnited States. However, procedures for sampling, datacoIlection, analysis, and compiling data can be quitedifferent. Even the structure of the agricultureindustry, the structure of the two governments and ofthe two agencies plays a role in how data are coIlected.summarized, and published.

NASS, the statistical agency for the U.S.Department of Agriculture (USDA). is responsible foragriculture production and inventory statistics and someof the economic statistics. Economic Research Service,another economic agency within USDA, is responsiblefor compiling the' ~t8.tistical data into farm income andeconomic projections. World Agriculture OutlookBoard, also one of the USDA economic agencies, usesthe U.S. agriculture statistics from NASS, along withdata from other countries, to estimate the worldagriculture supply and demand.

The Agriculture Division of Statistics Canadais responsible for agriculture production, inventory andeconomic statistics and also compiles data for farmincome and economic projections. Some economic dataanalysis work for outlook projections is done inconjunction with Agriculture C.anada. AgricultureCanada is the agriculture policy ministry (department)of the Canadian Federal Government.

The primary difference in the structure ofNASS and AgDiv is that AgDiv is part of a centralizedstatistical system known as Statistics Canada. WithinSTC there are several program divisions, includingAgDiv, that have the responsibility of preparingstatistical data related to their division. Other divisionshave specific supporting functions to all STC program

73

divisions, such as research, survey design, computerprogramming, dissemination. etc. The de-<:entralizedU.S. statisti.::al system has several statistical agencies.Each statistical agency is responsible for a particulararea of data but must also support their own research.survey deSIgn. programming, dissemination, etc.NASS is the statistical agency for agriculture withinUSDA.

Both organizations have field offices tofacilitate data collection, but the functions of theseoffices are different. For NASS, State StatisticalOffices (SSO) are responsible for list maintenance, datacollection, data entry, summarization, analysis andadministrative work. The SSO's are a significant partof the NASS structure and have a great deal of inputinto the analysis and estimation of the data. Theyreceive guidance and support from the main office inWashington, D.C. and concentrate primarily onagriculture related statistics. The State offices alsohave the free.dom to be involved in state funded projectsthat are not part of the National program.

For AgDiv, data coIlection is done at theRegional Offices (RO). The ROs are part of STC andare primarily data collection and report disseminationcenters. The ROs are not directly tied to AgDiv andthey collect all types of statistical data. They generallydo not get involved in the analysis. summarization orestimation of the data. However, the AgDiv does haveagreements with each of the Provincial Governments.These agreements state that the Provincial StatisticianswiIl review the data and estimates prior to thepublication of the data. These statisticians are aIlowedsome input into the level of the published estimates.

Both organizations use sales of agriculturalproducts, without regard for acreage, as the definingfactor for establishing an operation as a farm.However, the cut-off for the sales value is different.The definition of a farm for the U.S. is any operationthat has $1,000.00 in agriculture sales or expectedsales. For Canada the definition of a farm is anyoperation that produces agricultural product(s) for sale.

Page 78: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

Census of Agriculture

The Agriculture Census for Canada IS

conducted every five years by AgDiv. Theenumeration coincides with the Census of Population.Therefore, a question is asked on the population censusquestionnaire about farming interests. If the responseis positive, then an Agriculture Census questionnaire isfilled out by the respondent. The questionnaires aredelivered by a STC enumerator and are mailed back.The enumerators do follow-up of the non-response.Analysis of data gives an expected under-coverage ofabout 1.5 % to 3 %, depending on the estimate.

Agriculture Census data are reviewed byAgDiv analysts at the provincial, and sub-provinciallevel, concentrating on the top contributors with mostof the manual editing done on a macro-level. Only ifsevere problems are detected, or in the review ofextremely large operators, are individual recordsreviewed by analysts. After Census data have beenreviewed and published, AgDiv completes a 5 yearhistoric review of all acreage, production, inventory,and economic estimates. Generally, AgDiv estimates,for the year of the census, are revised to match Ag.Census estimates, with minor adjustments due todifferences in reference dates. The impact of the undercoverage or duplication is assumed to be negligible.

The U.S.'Census of Agriculture is conductedevery 5 years by the Agriculture Division of the CensusBureau, part of the Department of Commerce. TheU.S. Agriculture Census is a stand alone collection, nottied to the U.S. Population Census. The U.S.Agriculture Census is a mail out survey with telephonefollow-up of the non-response. The AgricultureDivision of the Census Bureau maintains a list of farms.The list is updated from responses to their surveys andfrom outside sources, such as NASS, income taxrecords, etc. Under-coverage from the AgricultureCensus is about 13 % of the farms. Under coveragefrom the Census is primarily with the smaller farms.The Agriculture Census uses the NASS area frame toestimate potential under coverage. However, Census··published numbers are totals from the survey and arenot adjusted for the under coverage. Duplication alsocauses some problems, especiaIly with producercontract arrangements. Census data are reviewed byNASS statisticians at the county, State, and Nationallevel. Like AgDiv, data are reviewed on a macro-level with review of individual records limited to severeproblems and extremely large operators. After Census

74

data have been reviewed and published, NASScompletes a 5 year historic review of all acreage,production, and inventory estimates. However,estimates from the Agriculture Census are not used asofficial NASS estimates. NASS makes adjustments toCensus data to account for duplication, under coverageand differences in reference dates.

The linkage between the Canadian Census ofAgriculture and the Population Census, allows for moreCensus estimates of the social characteristics ofagriculture. However, the Agriculture Division of theU.S. Bureau of Census conducts foIlow-on surveys toestablish estimates for most of the same statistics.

Sampling Frames

List - The farm register (list frame) for AgDiv is basedprimarily on names received during the Ag. Census.The Ag. Census is used as the basis for the list framefor a 5-year period. The majority of updates are basedon changes found during surveys conducted during the5 year period. The list frame, for most probabilitysurveys, is frozen between Census years. The samplesfor these surveys are selected shortly after the currentCensus. New names are not generaIly added to theframe, except names of operators that are new toagriculture and take over an existing operation areaIlowed to replace an existing name. Some surveys,such as the fruit and vegetable survey, do use producerorganization lists to update new names in betweenCensus occasions. These samples are re-drawn everyyear. Coverage at the time of the Census is estimatedto be about 97 % for most samples, but this percentagedrops at the rate of about 1-2 % per year after theCensus, depending on the commodity being measured.

The list frame for NASS is continually updatedand samples are redrawn every year. Since updates tocontrol data are based on information received duringsurveys and on information from producer organizationsand government program participation lists, etc., not allnames and control data are updated each year. NASSdoes not receive names or control data from the U.S.Agriculture Census. The NASS list frame was builtmany years ago from outside organization lists.Coverage runs at about 55 % for the number of allfarms, but varies significantly by State. Coverage isconcentrated on larger farms with coverage of farmland at about 80 %.

Area - The AgDiv area frame is designed to produce aweighted segment indicator to be used in conjunction

Page 79: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

with list frame surveys. When screening is done, theprimary data collected are name and addressinformation, total acres, acres in the segment (a pieceof land with identifiable boundaries used as a samplingunit in area-frame sampling), and some generalinformation about the type of farm. Agriculturaloperations are then determined to be either overlap(included on the list frame) or non-overlap (not includedon the list frame). The non-overlap operations are thenincluded in subsequent multi-frame surveys. Virtuallyno data indicators are produced from the area framealone.

The NASS area frame survey is designed toproduce closed segment indicators, open segmentindicators and weighted segment indicators. Indicatorsfrom the June Area Frame Survey are used both asindependent indications and also used in conjunctionwith list surveys to produce multi-frame indications.Area screening includes collection of tract (the area ofland located within a segment that is under a singleoperating arrangement) data for all agricultureoperations found in the area frame sampled segmentsand entire farm data for all operations that are notknown to be overlap with the list frame. Non-overlapoperations are also included in Agriculture Surveyprogram for the rest of the survey year. The NASSarea frame plays a much more significant role in theestimation program for NASS since the list coverage islower.

AgDiv has begun using telephone enumerationto identify area frame operators in some areas of thePrairie Provinces (Alberta, Saskatchewan, andManitoba). Segments in this region are drawn to followthe range and township boundaries. Since only limiteddata are collected at the time of the screening,preliminary results have been very favorable. Due tothe complexity of segment boundaries in the otherregions, the segments are personally enumerated. AllNASS segments are personally enumerated due to boththe precise segment boundaries and the amount and typeof data that must be collected. The design of theAgDiv area frame specifically excludes areas notconsidered to be involved in agriculture, such asrangeland and urban areas. The NASS design includesthese areas but samples them at a proportionally lowerrate.

Sample Design

Both organizations use probability sampledesigns on all major surveys. Crop estimates for

75

AgDiv are based on a series of surveys called the CropPanel surveys. The Crop Panel surveys begin with aSeeding Intentions and Grain Stocks Survey in earlyApril. The panel surveys continue with the JuneAcreage Survey, July 31 Yield and Grain StocksSurvey, September 15 Yield Survey, NovemberAcreage & Production Survey, and December 31Production & Stocks Survey.

Crop estimates for NASS are based on a seriesof integrated surveys called the Agriculture Surveyprogram and the monthly Agriculture Yield Surveys.The March 1 Agriculture Survey is used to estimateseeding intentions. The June 1 Agriculture Survey isused to establish the planted acres and preliminaryharvested acres. The September 1 Agriculture Surveyis the end-of-season indicator for production of smallgrains and the December I Agriculture Survey is theend-of-season indicator for production of other fieldcrops and hay. The Agriculture Surveys are also usedto collect quarterly on-farm grain storage data. Themonthly Agriculture Yield surveys collect yield andproduction data on crops during the growing season.The crops included in the survey will vary from monthto month depending on the growing season of each cropand the program for that crop. The first small grainyield survey is conducted in May and the first rowcroplhay yield survey is in August.

The actual sample design is fairly similar forthe two organizations with both designs stratified byacreage of cropland. The NASS sample is stratified byState on cropland and grain storage, with some stratafor specialty crops such as tobacco. The AgDiv CropPanel is stratified on cropland by sub-provincialregIOns.

The Agriculture Survey sample for NASS isalso stratified to collect quarterly hog & pig inventoryand farrowing data. Reference dates for hog inventoryestimates are the same as the dates for the fourAgriculture Surveys. The cattle & sheep estimatesreference dates are January I for both cattle and sheepand July I for cattle only. Therefore, cattle and sheepdata are collected via a separate survey with separatestratification and sampling. The AgDiv livestocksurveys are done twice each year and include cattle,hogs, and sheep. The reference dates for the livestocksurveys are January 1 and July 1.

There are also numerous probability and non-probability surveys conducted by both organizations toobtain statistics on commodities such as fruit,vegetables, poultry, prices paid and received by

Page 80: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

farmers, farm income and expenses, etc. Due to thestructure of the farm programs and marketing boards,there are more administrative data available to theAgDiv than are available to NASS. The supplymanaged commodities, which are milk, eggs andpoultry meat, have extensive administrative dataavailable that are used by the AgDiv in lieu of surveydata. Farm expense data for AgDiv are obtainedthrough a sample of income tax records rather than anadditional survey of farmers. Both organizations useadministrative data whenever available to help relieverespondent burden and lower data collection costs.

Data Analysis

NASS questionnaires that are not collectedusing Computer Assisted Telephone Interview (CATI)procedures are put through a complete manual reviewprior to data entry. Discrepancies are reviewed andcorrected. Within AgDiv, the manual editing of dataprior to data entry is virtually non-existent. Data thatare not collected with CATI are briefly reviewed by thedata entry division for clarity. However, data are notedited as being "correct" or "incorrect".

Computer editing of data, after collection, isused in both organizations. NASS's computer editingis designed to review data and flag errors with only asmall number of the errors corrected by the editingprogram. The re~ining errors are then reviewed andcorrected by statisticians. The computer editingprogram for AgDiv is designed to make corrections orperform imputations for most of the errors. Statisticianreview and correction of the remaining micro-levelerrors is not as prevalent.

Expansion (or raising) factor adjustment isused by both organizations in most probability surveysto account for missing data and refusals. Some surveysin both organizations use automated imputationprocedures. Response rates for the production surveysare very similar between the two agencies.

Estimation

The estimation program in NASS requires thatmost data and estimates be reviewed, analyzed, andapproved by the Agriculture Statistics Board (ASB).The ASB is made up of the ASB Chairperson, theEstimates Division Director, the respective commoditystatistician(s) and their Branch Chief, and 1 or morestatisticians from I or more SSOs. For selected

76

estimates, the Secretary of Agriculture, or arepresentative from the Secretaries office is briefedabout the estimates prior to the release of the data.Within AgDiv, generally only the statisticians (Federaland Provincial statisticians) and their immediatesupervisor review the data and estimates prior to therelease.

The concept of livestock inventory estimatesare nearly identical. The weight groups, age groups,and livestock definitions are virtually the same.However, the reference dates and estimation proceduresare different. AgDiv produces sheep inventoryestimates twice a year while NASS produces theseestimates only for January 1 each year. Bothorganizations produce January 1 and July 1 cattleinventory estimates. For hog estimates, the referencedates are off by one month. NASS's reference datesare December 1, March I, June I, and September Iwith AgDiv dates being January I, April I, July I, andOctober 1. NASS conducts a survey for each of the 4quarterly estimates while AgDiv makes the estimatesfor April I and October I without the use of a survey.The inventory estimates are based on administrativedata and previous survey data.

Both organizations, in spite of the differencesin structure, make use of market trends and informationprovided by field experts. The primary estimation toolfor both organizations is the balance sheet. AgDivestimates are made so the balance sheet residual is zerowhere NASS will generally allow small residuals toremalD.

Crop estimates for seeding intentions and forpreliminary yield surveys have a different base conceptbetween NASS and AgDiv. The seeding intentionsestimates from NASS are designed to forecast what theactual planted acres will be, thus requiring thestatistician to make a forecast of the seeded acres. ForAgDiv, seeding intentions are designed to be a pointestimate showing current seeding plans of farmers,without making a projection of what planted acres willactually be.

The same concept holds true for yield surveys.With NASS, data analysis done with the monthly yieldsurveys is designed to forecast the final yield andproduction of the each commodity. NASS usesobjective yield surveys for the wheat (winter, spring &durum) , com, soybeans, cotton and potatoes. Theobjective yield surveys are conducted in the majorproducing states and usually account for over 80% ofthe production. The objective yield models are

Page 81: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

designed to compare current conditions with historicconditions and compare to final yields. For AgDiv,data analysis is designed to estimate current yields. withno adjustments made for historic trends or comparisonsof survey indications to final yields. The only objectiveyield survey for AgDiv is for potato production.

Conclusion

Most of the differences mentioned above haveadvantages and disadvantages when compared together.Despite the differences in structures and externalcontrols, both organizations produce a wide array ofagriculture statistics that are used to establishagriculture policy for the respective countries.However, the background and even the definition ofwhat the data represent are often quite different.Therefore, when comparing the data from eachorganization, the data concepts are equally as importantas the numbers themselves.

77

Page 82: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

ANALYSIS OF RESPONSE BIAS IN THE JANUARY 1992 CATTLE ON FEEDREINTERVIEW PILOT STUDY AND THE JULy 1992

CATTLE ON FEED REINTERVIEW SURVEY

Robert Hood, USDAINASSResearch Division/3251 Old Lee Hwy., Fairfax, VA 22030

Abstract. To assess the accuracy of reported cattle onfeed (COF) inventory, the National AgriculturalStatistics Service (NASS) developed a series ofreinterview surveys to study response bias and toidentify specific reasons for reporting errors in order toimprove the survey instruments, training and estimationfor COF inventory. A three-phase plan, including apilot study in January 1992, a semi-operational surveyin July 1992 and a fully operational survey in January1993, was designed to meet these objectives. Thispaper discusses the results of the January 1992 and July1992 COF reinterview studies.

For each study, a subsample of respondents reportingfor the parent survey was recontacted for face-to-facereinterviews in which a subset of the original questionswas re-asked. Differences between the reinterviewresponse and the original parent survey response werereconciled to determine a final "proxy to the truevalue", which was used to measure response bias.

Although no bias estimates were possible for theJanuary pilot study: useful cognitive information wascollected. For J'uly, response bias estimates weregenerated for several survey items. Althoughdifferences were observed between the reinterviewresponses and the original parent survey responses, thenet bias was not significantly different from zero fortotal COF inventory. The contribution to the bias dueto reasons for differences between the responses wasalso examined to detect any underlying relationships.

I. INTRODUCTIONOver the years, the National Agricultural StatisticsService (NASS) has conducted a variety of reinterviewsurveys to evaluate the quality of its AgriculturalSurveys (AS). The purpose of these reinterview,surveys has been to study response bias (as opposed toresponse variance) and to determine reasons forreporting errors. To assess the accuracy of reportedcattle on feed (COF) inventories, a new series ofreinterview surveys were developed to study responsebias. Specific reasons for reporting errors wereobtained to guide efforts to improve the surveyinstruments, training and estimation for cattle on feed

78

inventory. The main focus of this reinterview programwas cattle on feed reporting by smaller farmer-feederoperations, as opposed to larger commercial feedlots.A three-phase plan was designed to implement areinterview program for COF at NASS. This planincluded a one-state pilot study in January 1992, a two-state semi-operational survey in July 1992 and a fullyoperational five-state survey in January 1993. Thispaper discusses the setup and results of the first twosteps.

In estimating response bias, a "proxy to the true value"must first be obtained. In this study, as in previousreinterview studies at NASS, the reconciled value wasconsidered to be the "true" or final value. Considerablecost and effort was expended to ensure that the valueobtained during reconciliation was the best proxy to thetrue value, as reinterviews were done face-to-face andconducted by supervisory and experienced enumerators.When the original and reinterview responses differed,the enumerators were instructed to determine the"correct" response during the reconciliation process. Ifthere was no difference, i.e. the same response wasgiven during both interviews, this common responsewas considered the final value. If the respondent couldnot determine which response was correct, or if adifference was not reconciled by the enumerator, thefinal value was missing and the observation was notused for that item. If the respondent indicated thateither response could be correct, then the average of thetwo responses was used as the final value. A thirdresponse, different from both the original andreinterview responses, was also possible if thereinterview respondent said that neither the original northe reinterview response was correct.

The formulas used to calculate response bias andvariance estimates were based on a stratified randomsample design. For the ilh observation in stratum h,response bias was measured as: Bhi = 0hi - Fhi forstratum h = 1, .... ,L and unit i = 1, .... ,1\, where

Ohi = original responseFhi = final or reconciled value.

A negative bias indicates underreporting of a surveyitem, whereas a positive bias indicates overreporting.

Page 83: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

n. REINTERVIEW PROCEDURESFor both January and July 1992, a subsample ofrespondents reporting for the respective parentAgricultural Survey was recontacted by supervisory andexperienced enumerators for face-to-face reinterviews.To get the most accurate data possible, enumeratorswere instructed to contact the person mostknowledgeable about the operation, even if that personwas not the same as the parent survey respondent.Reinterviews were to be conducted within ten days ofthe initial survey in order to minimize recall bias.

Responses to the parent survey were provided to theenumerators in a sealed envelope on a reconciliationform. The reconciliation form contained the questionsthat appeared on both the parent survey and thereinterview survey; the parent survey responses; andspaces to record the reinterview response, thereconciled "correct" response, and a written explanationin the event that a difference between responsesoccurred. To maintain the independence between thetwo responses, the envelopes containing the originalparent survey responses were not to be opened untilafter the reinterview was completed. Having twoindependent responses and asking the respondent toresolve any discrepancies enabled us to obtain the bestpossible data.

Immediatel y after conducting the reinterview, theenumerator would open up the reconciliation form andexplain to the respondent that he/she had theinformation obtained from the initial survey and wouldlike to compare the responses for the few items thatappeared on both interviews. Each difference (nomatter how small) would then be reconciled to obtainthe "correct" response, and a written explanation forwhy the difference occurred would be recorded on thereconciliation form.

The reinterview questionnaire, used to collect a secondindependent response for comparison to the originalresponse, was similar to but shorter than the parentsurvey for both January and July. Reinterviewquestionnaires for January and July were almostidentical. Questions that were common to both theparent survey and reinterview survey included questionspertaining to basic operation description, total cattle onfeed inventory and total cattle inventory. Somequestions were shortened by dropping "include" and/or"exclude" phrases, while others were reworded in orderto ensure that the reinterview/reconciliation processobtained the best "proxy to truth". If a cognitiveproblem exists with the current operational wording of

79

a particular question, then simply re-asking the questionthe same way may not uncover an underlying responsebias. Since questionnaire wording was to be studied,enumerators were instructed to ask the reinterviewquestions exactly as worded on the questionnaire. Thereinterview questionnaires for both January and Julycontained additional "cognitive" questions as well as asection on terminology (in which the respondent wasasked to give hislher definition of some terms currentlybeing used in our surveys) to be used in evaluatingsurvey definitions and concepts, as well asquestionnaire wording. "Probing" questions were alsoasked to determine if all cattle on feed were beingreported and being reported accurately.

II. Januarv 1992 Pilot StudyIn January 1992, a reinterview pilot study wasconducted 10 Iowa during the NASS JanuaryAgricultural Survey. The objective of this study was towork out the logistics of conducting a reinterviewsurvey for cattle on feed and to field test thereinterview and reconciliation forms. A small non-random subsample of respondents to the JanuaryAgricultural Survey who were initially contacted byComputer Assisted Telephone Interviewing (CA TI)were selected for face-to-face reinterviews. Thesubsample was concentrated roughly within a hundredmile radius of the State Statistical Office located inDes Moines. Samples eligible for reinterview werethose that reported positive cattle on feed capacity onthe initial CATl interview. Of the thirty-two completedreinterviews, twenty-six reported both positive cattle onfeed capacity and cattle on feed inventory, while sixreported positive capacity but no inventory.

Although no response bias estimates or other statisticswere possible for this small non-random sample, thelogistics of conducting a reinterview for cattle on feedwere worked out and information on the problems ofreporting cattle and cattle on feed data were obtained.Some general results from the January pilot study arelisted below.

• Cattle were often misclassified. The reference toheifers in three of the six breakdowns seemed toconfuse the CATI respondent as to which categoryshould be used, often resulting in some animals beingcounted twice.

• Collecting data by phone can be difficult, especiallywhen a question contains multiple categories, such asthe cattle breakdowns consisting of six possible

Page 84: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

categories. The respondent cannot see all the possiblechoices at one time, thus he does not know what hisoptions are and may include animals in one categorythat should be included in a later category. Severalrespondents said they would not have had to adjust theirnumbers as often if they had known all the choicesbeforehand.

• Placing animals into the correct weight categorieswas difficult for both CAT! and reinterviewrespondents. There was a lot of guessing as to whetheror not cattle were over 500 pounds. Animals less than500 pounds are considered to be calves by NASS.

• Total cattle inventories were often misreported dueto incorrect classification of animals and by theplacement of animals into more than one category.

• There was great variability in the definition of a calfamong the respondents for this survey. Somerespondents used weight as a criterion, while othersspecified age.

• Reported feedlot capacity for cattle on feed probablyindicates the maximum number an operation could everhold, not the maximum number that would nonnally befed for the slaughter market.

1lI. Julv 1992 Reinterview SurveyThe July 1992 Cattle on Feed Reinterview Survey wasdesigned as a semi-operational survey to facilitate thetransition from a research activity to an operationalprogram in January 1993. The primary objectives wereto provide real-time response bias estimates for agencyuse, to expand the domain of samples eligible forreinterview beyond CATI, and to continue collectingcognitive information to improve both the reinterviewand operational survey instruments.

Reinterviews were conducted on a subsample of theJuly Agricultural Survey (AS) respondents originallycontacted by CATI in Iowa and Minnesota. A smallsubsample of non-CA TI respondents were also selectedfor reinterviews in Iowa. Making non-CA TI sampleseligible for reinterview was an innovation forreinterview studies at NASS. The non-CATI domainwas included because it continues to represent asignificant amount of our AS data collection,particularly during the January AS for which thereinterview program is designed. A stratified randomsample with stratum sampling rates similar to the parentsurvey stratum rates was allocated for reinterview.There was a total of 440 samples selected for

80

reinterview, with 220 in each state. Of these, onlycompleted parent survey samples, including those codedout-of-business, were eligible for reinterview. Parentsurvey refusals and inaccessibles were ineligible forreinterview. Out of the 440 units selected forreinterview, 303 units were eligible for reinterview and266 had both usable reinterview and parent survey data.The reinterview non-response rate (for the 303 eligibleunits) was only 9.2%.

For July, response bias estimates for total cattle on feedand total cattle and calves were generated at both thestate and the two-state combined levels. Response biasestimates were calculated for original response minusfinal response and for edited data minus final response.Original and edited data produced similar results withrespect to statistical significance for the two states.Response bias estimates for edited minus final valuesare shown in Table 1. No significant response bias wasdetected for total cattle or cattle on feed at either level.There was wide variability in the response biasestimates in both magnitude and direction (i.e., positiveor negative) between the two states for total cattle onfeed. Iowa reporting showed negative biases of 2.8 %compared to positive biases of 13.4% for Minnesota.Although no significant response bias was detected,differences between the initial and reinterview surveysdid occur. Nearly half (48%) of the responses differedbetween the two surveys for total cattle and about onequarter (24 %) of the responses differed for total cattleon feed. The differences simply tended to cancel eachother out.

The precision of the bias estimates was very low, asindicated by the large standard errors, relative to thebias estimates. The small sample size was not the onlyfactor influencing the bias estimates and the significancetests. The actual number of non-zero differencesplayed an important role also. Although there were 266usable observations overall, the actual number ofdifferences was far less for each item. There were 52non-zero differences for cattle on feed and 112 for totalcattle and calves. These few differences were spreadover 10 strata in Iowa and 8 strata in Minnesota. Withsuch a structure, the small number of non-zerodifferences, the large number of zero differences, andthe large expansion factors resulted in extremevariances which resulted in low precision for theresponse bias estimates. This lack of precision ofresponse bias estimates is a problem that continues toplague us with reinterview surveys. Work continues onsample design and estimation improvements to increaseour response bias estimation precision.

Page 85: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

Table 1. Response Bias Estimates for the July 1992 COF Reinterview Survey.

Edited Value - Final ValueStandard

Item/State Bias % of Edited Error 95 % CI

TOT AL COF

Iowa -25,912 -2.8 3.7 (-10.0,4.4)

Minnesota 60,117 13.4 10.5 (-7.3,34.0)

Total 34,205 2.5 4.0 (-5.3, 10.3)

TOTAL CATTLE

Iowa -74,411 -1. 8 2.8 (-7.4,3.7)

Minnesota -48,080 -1.9 2.4 (-6.7,2.9)

Total -122,491 -1.9 2.0 (-5.8,2.0)

IV. REASONSOne of the goals of the July reinterview survey was toidentify the reasons for discrepancies between theoriginal and reinterview responses in order to evaluatethe questionnaires and to detennine how much of thebias may be fixable. During the reconciliation process,explanations were recorded by enumerators for eachdifference that occurred between an original andreinterview response. These reasons were then groupedinto three general categories. "estimation or rounding","definition or interpretation" and "other" (i.e., reasonsthat could not be attributed to the first two categories).In general, differences due to "definitional" reasons canbe viewed as being potentially tlxable by changes in thesurvey instruments, procedures or training. Differencesdue to "estimation" or "other" reasons probably are notas correctable, if correctable at all.

Since response biases can be positive or negative andtherefore cancel each other out, using the net bias couldbe misleading when analyzing biases. Therefore, the

absolute value of each non-zero difference wasexpanded to obtain the total absolute response error foreach reason category. Table 2 shows the frequency ofdifferences by reason category and the percentage of thetotal absolute response error attributable to eachcategory. While "estimation" reasons accounted for38.5 % and 20.5 % of the differences for COF and totalcattle, respectively, these reasons contributed the leastto the total absolute response error (8.6 % for COF and4.9% for total cattle). "Definitional" reasons wereresponsible for the majority of the total absoluteresponse error for COF, accounting for 66.4 % of thebias, while "other" reasons, responsible for 60.7% ofthe bias, contributed the most for total cattle. Table 2shows that there is opportunity for improvement in thesurvey procedures, instructions and questionnaires.Recall that reasons due to "defmitional" problems areconsidered tlxable. "Detlnitional" reasons accountedfor almost two-thirds of the total absolute responseerror for COF and over one-third for total cattle.

Table 2. Percentage of Total Absolute Response Error by Reason Category for Original Minus ReconciledValues. Frequencies of Response Errors are Shown in Parenthesis.

Reason Category

Item Estimation Definition Other Total

Total Cattle on Feed 8.6% (20) 66.4 % (l9) 25.0% (13) 100% (52)

Total Cattle & Calves 4.9% (23) 34.4% (19) 60.7% (70) 100% (112)

81

Page 86: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

Table 3. Frequency Table of Relative Bias by Reason Category (Two States Combined).!

Reason Category

Estimation Definition Other

Item/Relative Bias2 # of Obs % of Bias # of Obs % of Bias # of Obs % of Bias

Total Cattle on Feed

Bias =s;;20% 17 (85 %) 3 (16%) 5 (38 %)

Bias > 20% 3 (15%) 16 (84%) 8 (62 %)

Total 20 (100%) 19 (100%) 13 (100%)

Total Cattle & Calves

Bias =s;;20% 21 (91 %) 9 (47%) 52 (74%)

Bias > 20% 2 ( 9%) 10 (53%) 18 (26 %)

Total 23 (100%) 19 (100%) 70 (100%)

I Includes only observations with a bias2 Relative Bias = 100 * (Original value - Reconciled value)/Reconciled value

In order to study the relationship between the magnitudeof the bias and the reason categories, a relative(percentage) bias was calculated for each observationwith a non-zero difference between the original valueand reconciled values. Two levels of relative bias wereused - less than or equal to 20 % in magnitude andgreater than 20% in magnitude. Table 3 shows therelationship between the magnitude of the relative biasand the reason categories. The results indicate thatthere is a significant relationship between the magnitudeof the relative bias and the reason categories."Estimation" reasons tended to be associated withsmaller biases for both items. "Definition" reasonswere associated with larger biases for cattle on feed butwere more evenly distributed for total cattle. "Other"reasons were associated with larger biases for cattle onfeed but with smaller biases for total cattle.

A Closer Look at Total Cattle on FeedThe primary focus of this series of reinterview surveys(i.e., tbe January 1992, July 1992 and January 1993surveys) was cattle on feed inventory. The reinterviewprogram for cattle on feed grew out of the concern thatinventories were being overreported in the farm feederstates. Thus, the results of the July 1992 reinterviewstudy may have been somewhat surprising. Nostatistically significant bias at the individual orcombined state levels was detected. In fact, the resultsindicated only a slight overreporting of 2.5 % at thecombined level. Iowa reporting indicated a slight

82

underreporting of 2.8 %. Minnesota overreporting wasestimated 13.8 %, but the variance was large enough forthe result to be insignificant. Do these results thenindicate that there is no problem? Not necessarily!What must be remembered when looking at the resultsfrom July is the sample size was very small. With sucha small sample size (recall that there were only 266usable samples), the results are very volatile andmistakes on just a few reports can have an enormousimpact on the final bias estimate.

Table 2 showed the percent of the total absoluteresponse error accounted for by each of the threereason categories. "Definitional" reasons were themajor contributor, accounting for 66 % of the totalabsolute response error. "Other" reasons wereresponsible for 25% and "estimation" reasons for about9%. The differences attributable to "definitional"reasons are listed below in Table 4. Also shown aretheir individual percent contribution to the "defmitional"absolute error and the number of times each reason wasreported.

For each of the five reported "misunderstandings", thereinterview response was determined to be the correctresponse during reconciliation. The source of thereporting error for these five samples was attributed toeither the initial respondent, the initial enumerator orboth. The same person responded for two of the fivereports. For the five cases of "did not understandquestion" , the reinterview response was also determined

Page 87: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

Table 4. Defmitional Reasons Reported for Total Cattle on Feed (fwo States Combined).

% of Definitional Absolute Number of TimesReason for Difference Response Error Reported

Included cattle/calves from another operation 0.5 1

Did not report as of the reference date 4.9 2

Respondent did not figure death loss in total 7.2 2

Respondent did not understand the question 9.6 5Respondent forgot to include some cattle or calves 13.7 4

Misunderstanding between enumerator & respondent 64.1 5

Total 100.0 19

to be the correct response. The source of error wasattributed to the initial respondent in four cases and toboth the initial respondent anu the initial enumerator inthe other case. The same person responded to four ofthese five cases.

For cattle on feed inventory. there was a total of 52non-zero differences between the original andreinterview responses (excluding one outlier); 34 inIowa and 18 in Minnesota. There was variability in thecomposition of the differences between and within thetwo states. Iowa had about four times as many negativedifferences as Minnesota (21 vs. 5). Minnesota hadmore positive differences than negative (13 vs. 5),while the opposite was true for Iowa (21 negative vs.13 positive).

Of the 13 negative differences for Iowa, 4 were due toa "misunderstanding between the enumerator andrespondent", accounting for 46 % of the total negativebias for cattle on feed in Iowa. Two cases in which the"respondent forgot to include some cattle or calves"accounted for almost 21 % of the total negative bias.Eight "estimation" reasons accounted for only 12 % ofthe total negative bias. As for the positive differences,the major contributor was one case in which the"respondent had not made a decision on marketings",accounting for almost half of the total positive bias forIowa.

Whereas the reason "misunderstanding betweenenumerator and respondent" accounted for 46% of thetotal negative bias for Iowa, one difference due to thisreason was responsible for 70 % of the total positivebias in Minnesota. For Minnesota, the five negativedifferences contributed very little to the overall bias. Inall, there were seven "estimation", eight "definitional"and three "other" reasons for Minnesota. Todemonstrate just how volatile the bias estimates were,without the one difference due to a "misunderstanding",the percent bias in Minnesota would have dropped from13.4% to only 3.7 %.

83

In order to reduce response bias and improve datacollection, enumerator training should emphasize thereason why a reinterview is being conducted, why it isimportant to read the questionnaires exactly as wordedand the importance of a positive attitude whenconducting a reinterview. With the relatively smallsample size, data quality is very important. As wasseen in the July 1992 reinterview survey, oneobservation can completely change both the magnitudeand direction of the bias estimates for a survey item, sotaking the time to collect good data must be stressed.

CONCLUSIONAlthough the January and July 1992 reinterview studiesdid not detect any significant overall response bias forcattle on fe.~ and total cattle inventories, usefulinformation on problems associated with reporting cattleon feed, as well as cattle, was obtained. The twostudies showed a substantial number of differencesbetween original and reinterview responses whichresul ted in great variabili ty . However, the di fferenceswere nearly offsetting, resulting in non-significantresponse bias. The results also indicated that there maybe room for improvement in the current surveyprocedures (including questionnaire design andwording) used to collect COF data. "Definition orinterpretation" problems were found to account fornearly two-thirds of the total absolute response error forCOF. This can be looked upon as being both good andbad. It is bad in the sense that so much "definitional"bias indicates that there may be a problem with theoperational survey. However, it is good in the sensethat "definitional" problems are considered more"fixable" than "estimation" or "other" problems. In ourefforts to reduce response bias and to improve thesurvey instruments, high priority ought to be given toreducing the errors attributed to "definitional" reasons.

Page 88: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

AN EVALUATION OF ROBUST ESTIMATION TECHNIQUESFOR IMPROVING ESTIMATES OF TOTAL HOGS

Susan Hicks and Matt Fetter, USDAINASSSusan Hicks, Research Division, 3251 Old Lee Hwy., Room 305, Fairfax, VA 22030

I. Abstract

Outliers are a recurring problem in agriculturalsurveys. While the best approach is to attack outliersin the design stage, eradicating sources of outliers ifpossible, large scale surveys are often designed tomeet multiple, conflicting needs. Thus the surveypractitioner is often faced with outliers in theestimation stage. Winsorization at an order statisticand Winsorization at a cutoff are two procedures fordealing with outliers. The purpose of this paper is toevaluate the efficiency, in terms of true MSE, ofWinsorization for improving estimates of total hogs atthe state level and to evaluate the efficiency of a data-driven technique for determining the optimal cutoff.

KEY WORDS: Outlier, Winsorization, MinimumEstimated MSE Trimming

II. Design of the Ouarterly Agricultural Surveys

The National Agricultural Statistics Service (NASS)of the U.S. Department of Agriculture (USDA)provides quarterly estimates of total hogs at the stateand national level through its Quarterly AgriculturalSurveys (QAS). The QAS uses both a list and an areaframe in a multiple frame (MF) approach to provideestimates for a variety of commodities in addition tohogs. Known farm operators are included on the list.

The area frame sampling is based on land usestratification. All land in the contiguous 48 states hasa positive probability of selection in the area frame.Thus, the area frame is a complete frame and can beused to measure undercoverage in the list frame. Tractoperations found in the area sample are matched to thelist frame. Operators not on the list comprise the NonOverlap sample, or NOL.

III. Why are Estimates of Total Hogs so Variable?

Providing reliable estimates of total hogs through themultiple frame approach has always been difficult.The sampling variability in hog estimates is closelyassociated with the sampling variability in the NOL.While the NOL typically contributes only 25% to the

84

total estimate, its contribution to the total variance isaround 75%.

Outliers in the NOL can severely distort theestimates. Rumburg (1992) studied the causes andcharacteristics of NOL outlier records in five states.He cited three major contributors to outliers in theNOL:

• increased weights due to subsampling,

• the transitory and variable nature of hogproduction, and

• the location of hog operations on land with littleagriculture.

The area frame stratification, based on land usestrata, is more efficient for field crops than forlivestock items, which tend to be less correlated withland use. Basically the variability in the NOL domain,which is a subset of the area frame, can be attributedto two factors:

1) the population within each strata is highlyskewed to the right, and

2) the sample size is small.

IV. What is a Hog Outlier?

Most of the literature on truncation estimators forsurvey sampling describes its application to theproblem of variability in weights. For householdsurveys, where we're frequently estimating Bernoullicharacteristics, outliers are indeed caused by extremesampling weights. For agricultural surveys extremeobservations are caused by a combination of moderateto large weights and moderate to large values.

Lee (1991) addressed this problem by differentiatingbetween outliers defmed by classical statistics andinfluential observations. In classical statistics, outliersare unweighted values situated far away from the bulkof the data Influential observations are valid reportedvalues that may have a large influence on the estimate.Influential observations may involve outliers, but morefrequently are a combination of relatively large

Page 89: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

sampling weights and relatively large data values. Forour purposes the term outlier will refer to influentialweighted survey values, not unweighted values.

V. Current Procedures for Handling Outliers

n

Yt = Lad) w; y.j=1 J

where:t = truncation level

(1)

Wj = design weight for jth unit

In this version of the standard truncation estimator,we truncate the weights of those observations whoseweighted value expands larger than t so that theexpanded value now equals t. The truncated portionsare then "smoothed" over all observations.

We also evaluated estimators which adjust for the rlargest values. The form of the estimator is:

adj

(2)

if WjYj~t

if wjYj>t

WJ

n

L~/

WJ

for j=l, ...,n-r

n

Y, ~ L adj w; Yjj=1

,w·

J

Yj = reported hogs for jth unit

where:

1- MulIlpI. From. Est I

2

Effect of Outlierson the Multiple Frame Estimator

0.88806 8903 8912 9009 9106 9203 9212

Ouart.r

Currently each state reviews potential outliers duringthe editing stage. Typically the 20 largest weightedvalues are listed through the Potential Outlier PrintsSystem (POPS). The state commodity statisticiansreview these outputs and questionable data areverified. If the weighted data are correct, no changesare made.

However, the preliminary state hogrecommendations are adjusted for outliers. Figure 1shows the effect of extreme outliers in Georgia. InDecember of 1989, the operation that expanded to .7million hogs comprised approximately 42.4% of thetotal multiple frame estimate. This is exactly the kindof situation we want to correct for.Figure 1

Currently, outliers are adjusted in a somewhat adhoc fashion at the state level. Treatment of outlierscould include truncating the weight to 1.0, truncatingthe weight to some other value, or not truncating theweight at all. Although the effect of outliers iscompensated for in the state recommendation, therecords are never changed. This avoids the potentialfor biasing the national indication. Outliers at thestate level are rarely outliers at the national level.

adj

To evaluate the efficiency of each estimator forimproving estimates of total hogs at the state level wedeveloped a monte carlo simulation.

VII. Description of the Monte Carlo Simulation

VI. Description of Two Winsorization Estimators

We evaluated two types of robust estimators forimproving state level hog indications: Winsorization ata cutoff, t, and Winsorization at r order statistics. Theform of the estimator which adjusts to a fixed cutoffis:

We built our simulation around one state, Georgia.Because of the complexities of the multiple framedesign and because the major source of outliers andsampling variability is from the NOL, we restricted thesimulation to the NOL domain of the area frame.

A positive NOL segment is defmed as any sampledsegment that contains at least one NOL hog operation.Separate fann operations are divided into tracts. The

85

Page 90: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

IX. Minimum Estimated MSE Trimming

ratio of the MSE of the unbiased estimator to the MSEof the new estimator. See Table 1 in next section.

VIII. Evaluation of Winsorization at a Cutoff andWinsorization at an Order Statistic

MSERatio

1.034.993.908

1.018.980.918

123

123

MSE NumberRatio Truncated

l.3961. 440l.402l.17S

.739

l.243l.2991. 321l.236

Follow-on18000150001200090006000

TruncationLevel

June1400012000100008000

For both the June and follow-on samples,Winsorization at a cutoff is more efficient thanWinsorization at an order statistic. Further for smallercutoffs, more samples are truncated and moreobservations are truncated per sample. Thus, thebias in the estimator increases. Figure 3 shows thedecomposition of variance and bias at each level oftrimming evaluated for the June sample size.

In choosing a cutoff for truncation we'd like tominimize the number of samples truncated. Ingeneral, we don't want a cutoff that truncates everysample, but rather a cutoff that corrects for rareextreme observation like that depicted in Figure 1.

Ernst (1980) compared seven estimators of thesample mean which adjust for large observations.Four of the estimators were modifications ofWinsorization at a cutoff, t. The other three estimatorswere modifications of Winsorization at an orderstatistic. Ernst showed that for the optimal t, theestimator which substitutes t for the sample valuesgreater than t has minimum mean squared error.Earlier work by Searls (1966) showed that gains areachieved for wide choices of t when the data originatefrom an exponential distribution. The results from ourmonte carlo study are consistent with those studies.Table 1

In practice, we do not know the underlyingdistribution of the data and likewise the optimal valuefor t. The survey practitioner has to weigh thebenefits of trimming -- decrease in variance -- with the

number of sample segments in the area frame is fixedwhile the number of sample segments in the NOLdomain is variable. For the NOL domain the sampleunit is the segment, but the reporting unit is the tract.To simplify the simulation we modelled the positiveNOL tracts and assumed a fixed NOL sample size.

The tract level weight is a product of the stratumsampling weight and a tract adjustment factor. Theadjustment factor prorates an operation's reportedvalue back to the tract level for operations that onlypartially reside within the sample segment. Based onhistorical June data from '91 and '92 for Georgia, wedeveloped parametric models of the weighted tractlevel hog data for positive NOL tracts.Figure 2

900

OCIJ

700

,..000()

ij500

~ 400-300

200100

o10 70 130 190 = 310 ~ 430

Nurnbar of Hog. - Zeros Excluded

The three models -- one for each strata -- were allgamma density functions. Due to the sparseness of thedata it was difficult to validate the model. However,our main interest was in developing a reasonable,highly skewed distribution rather than developinghighly accurate models for Georgia NOL tracts.

We created a fixed sample universe for each stratumbased on the estimated gamma distribution and theestimated number of positive NOL segments. We alsoestimated the proportion of positive NOL segments tototal NOL segments. With the zero segmentsincluded, the result is a highly skewed population witha large spike at zero and a long right tail. See Figure2.

From the fixed universe, we drew 1000 stratifiedsimple random samples with replacement. Table 1 wascreated using a SAS program based on 1000 samples.Some of the other graphs were created from a Fortranprogram using the same data based on 10,000 samples.To compare the performance of the estimators fordifferent sample sizes we chose a sample of size 360to mimic the June sample and a sample of size 216 tomimic a follow-on sample.

The efficiency of the estimators was estimated as the

Distribution of Hogs per Tractby Stratun

86

Page 91: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

Figure 3

Effect of Truncation on Reliabilityof the Estimates

..····MSE···-·- -·-·

::~:L:·:~:::~:=:::::~:·~~~::- ._ _.]:.~:~-.-.::

._m.Bias ..Square.cl ..

X. Evaluation of Minimum Estimated MSETrimming as a Data-driven Estimator

estimators and the validity of the correlationassumption. In general, for a simple design andignoring the effects of editing and nonresponseadjustments, we have unbiased estimators for V(Y).However, obtaining an unbiased estimator for thevariance of a truncation estimator is lessstraightforward. One approximation that is often madeis to estimate the sampling variance of Yt by treatingthe trimmed weights as if they represented theuntrimmed "eights in the usual variance formulae .We have used this approximation in (3).

800014000 12000 10000Fixed Cutoll

900

8SO

800

150..c: 1000;;;

~ 650

600

~SOO

450nooe

costs -- increase in bias.The obvious criterion for evaluating the effect of

different trimming levels is the estimated MSE. Potter(1988) documented a procedure called MinimumEstimated MSE Trimming. Because we do not knowthe true parameter, Y, we are limited to evaluating thebias based on the unbiased estimator Y. The estimateof MSE (Yt) is derived from the relation:

E(yr-f)2 = Var(Yr)+Var(Y)-2Cov(YI'Y)

+[E(Yr)-E(y)f

= MSE(Yr) +Var(¥) --2Cov(Ypy)

where:Yr = the trimmed estimate

Y = the unbiased estimate

Thus, an unbiased estimate of MSE(Y J is:

MSE(Yr)=(Yr- Y)2- V(Y)+2c6V(YpY)

If the correlation between the truncated estimate andthe untruncated estimate is approximately 1.0, then thisreduces to:

MSE(Yr) = (yr-Y)2 - V(y) + 2 [V(Yr)V(Y)]l(2 (3)

In this procedure, the estimated MSE is computedfor various trimming levels and the trimming levelwith the minimum MSE is selected forimplementation. This procedure can be used tosuggest optimal trimming levels in (I) or number ofobservations to trim in (2). While the minimum MSEtechnique should identify the optimal trimming levelover many samples, for any particular sample it couldidentify a trimming level far from the optimum. Thisoccurs because our estimate of MSE is conditional onthe sample we have drawn.

The efficiency of this estimator, in estimating thetrue MSE, depends on the efficiency of the variance

We wanted to evaluate the efficiency of theminimum MSE technique as a data-driven estimator.With this estimator each sample would be truncated atdifferent levels to determine the level that minimized(3). Thus, the cutoff varies from sample to sample.Some preliminary runs showed that the efficiency ofthis data-driven estimator was highly dependent on therange of trimming levels evaluated. Thus, weevaluated this estimator over the set of possibletrimming levels. The minimum trimming levelsranged from 2000 to 20,000 and the maximumtrimming levels ranged from 10,000 to 28,000. Weevaluated this estimator based on 10,000 monte carlosamples. For each monte carlo sample, the MSEestimator in (3) was used to determine the optimalcutoff for that sample for each range of trimminglevels. The minimum and maximum trimming levelswere each incremented by 1000 covering allcombinations within the ranges specified above.

The level of the estimate, YI, that minimized (3) wasretained for t:ach cutoff range and sample. The trueMSE of the estimator for each combination ofminimum and. maximum cutoff was calculated in theusual fashion based on 10,000 Yt estimates. And theMSE ratio is as defmed before.

Recall from Table 1 that the optimal cutoff is around10,000 to 11,000. As Figure 4 shows, this estimatoris most efficient when the range of trimming levelsevaluated is dose to the optimum trimming level. Asthe minimum cutoff is reduced the efficiency of theestimator drops off rather dramatically. Whereasincreasing the maximum cutoff has a minimal effecton the estimator.

For any particular sample the estimated MSEtechnique could identify a trimming level far from theoptimum. This occurs because the estimated MSE isconditional on the sample we have drawn. Thus asFigure 4 shows, the efficiency of this technique as anestimator depends on the range of cutoffs we choose

87

Page 92: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

Figure 4

Effect of Changing the Cutoff Rangeon tne Efficiency

of Minimum Estimated MSE Trimming1 240

M5E

1 162

Pat 10

, 08]

Cutoff10000 2000 6S0D

However, if we adopt the fixed cutoff estimator, weneed to be careful about choosing the cutoff.

Minimum Estimated MSE Trimming provides analternative to Winsorization at a cutoff when theoptimal cutoff is unknown as is frequently the case.However, the efficiency of this data-driven estimatoris dependent on how close the range of cutoffsevaluated is to the true optimum. And this techniqueis computationally intensive.

Minimum Estimated MSE Trimming could be usedto suggest optimal cutoffs at the state level, but asFigure 5 shows the estimated cutoff is still highlydependent on the sample data. In the end, determiningthe optimal cutoff for complex sample designs mayremain more art than science.

REFERENCES

Distribution of ·Optimall Cutoffsfrom Estimated Minimum MSE Trimming

to evaluate and how close that range is to the trueoptimum, similar to Winsorization at a cutoff.Figure 5

0.16

0.14

0.

0.

7000 11000 15000 19000'Optimai' CUtoffs

23000 ~ nottruncated

Ernst, L. R. (1980), "Comparison of Estimators of theMean Which Adjust for Large Observations,"Sankhya: The Indian Journal of Statistics, 42, 1-16.

Lee, H. (1991) "Outliers in Survey Sampling,"prepared for the Fourteenth Meeting of the AdvisoryCommittee on Statistical Methods.

Potter, F. (1988) "Survey of Procedures to ControlExtreme Sampling Weights," Proceedings of theSurvey Research Methods Section of the ASA.

Rumburg, S. (1990) "Characteristics of DirectlyExpanded Hog Data Outliers," NASS Staff Report,No. SRB-92-02, U.S. Department of Agriculture.

Searls, D. T. (1966) "An Estimator for a PopulationMean Which Reduces the Effect of Large TrueObservation," Journal of the American StatisticalAssociation, 61, 1200-1204.

Figure 5 shows the distribution of estimated"optimal" cutoffs in increments of 1000 when thisestimator is evaluated over the range 3000 to 24000.

Again, we see that the estimated "optimal" cutoff isdata-dependent.

XI. Recommendations

As has been proven theoretically by Ernst (1980),Winsorization at the optimal cutoff is preferable toWinsorization at an order statistic. We believe thisestimator holds promise for improving NASS statelevel hog indications. Further, while the data showsthat Winsorization at a cutoff performs well for a widerange of cutoffs, the best efficiencies are obtained forcutoffs greater than or equal to the optimum.

88

Page 93: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

AN ANALYSIS OF VARIANCE APPROACH TO DEFINING IMPUTATION CELLS FOR ACOMPLEX AGRICULTURAL SURVEY

John Amrhein, National Agricultural Statistics ServiceNASS/RD, 3251 Old Lee Hwy, Fairfax, VA 22030

The Model and the Test Statistic

•..where I' represents a solution to the nonnal equations,has a central chi-squared distribution with degrees offreedom equal to the rank of K' under the null

where y is a vector of values for a measured orobserved survey item; X is a known design matrix; {J'= (p, ai' ... aJ, where d is the number of effects, isa vector of coefficients of unknown value; and e is avector of residuals such that E(e)=O and V(e)= V.Assuming that the residuals are iid Nonnal randomvariables, thl~ standard analysis of variance can beconducted. Under these assumptions, the Waldstatistic:

(I)y=X/J+emodel

Complex survey designs are implemented toincrease the precision of estimates when there isknowledge about the underlying structure of thepopulation of interest or to facilitate data collection. Insuch cases the assumption of independent andidentically distributed (iid) observations underlyingconventional data analysis techniques, found in manycomputer analysis packages, is invalid. Ignoring thecomplex sampling design in favor of the iid assumptionduring data analysis results in biased estimation ofsampling variances. Therefore, an improved analysiscan be realize.d by accounting for the complex design.

Skinner et a1. discuss aggregated anddisaggregated approaches to modelling complex surveydata. The <lggregated approach models the surveyvariables of interest at the population level and accountsfor the survey design through adjustments to standardanalysis procedures under iid assumptions. Thedisaggregated approach includes the survey design inthe specification of the model. For example, columnsof binary vaJ;ables defining membership in strata orclusters can be included in the matrix of independentregressor vanables. A solution to the nonnal equationscan be obtained and linear combinations of thecoefficients can be used to test for significantdifferences of the cluster (or subpopulation) means.

Consider the fixed effects analysis of variance

Introduction

KEYWORDS: Analysis of Variance, Wald Statistic,Complex Survey, Imputation

Abstract

The use of models in the analysis of samplesurvey data continues to be an important area of study.Skinner et al. consolidate much of the work that hasbeen accomplished concerning this issue. This paperpresents an application of the analysis of variancemodel to complex survey data. Although the discussionfocuses on employing analysis of variance to defineimputation cells, the general method described isappropriate for many survey apphcations involving ananalysis of variance.

The first section of this paper discusses twobroad approaches to modelling complex survey data andintroduces a general linear model to be considered.The Wald statistic is presented as the conventional teststatistic for an analysis of variance. An adjustment tothe conventional technique that allows the application ofan analysis of variance to non-iid observations isdiscussed. The next section describes, through anapplication to a complex survey, how an analysis ofvariance model can be a useful tool to test theeffectiveness of imputation cells that are often usedwhen imputing for survey nonresponse.

Conventional analysis of variance techniques,such as F tests, that are used to detennine ifsubpopulation means are significantly different, rely onthe assumption that the observations are independentand identically distributed. It is often the case thatsurvey data, collected using a complex sampling design,violate this assumption. This paper demonstrates theuse of an analysis of variance model, adjusted for acomplex sampling design, as an effective tool to defineimputation cells when adjusting for nonresponse in acomplex agricultural survey. A solution to the nonnalequations is derived in the usual manner. However, aresampling technique is used to obtain a consistentestimate of the covariance matrix. This estimate is thenused to calculate a Wald statistic for conducting F testson testable hypotheses. Examples for several surveyitems are discussed.

89

Page 94: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

hypothesis Ho: K'/1=m, where K' is a contrast matrixwhose rows represent testable linear combinations ofthe coefficients. By correctly specifying K' and settingm=O, the significance of the effects in partitioning thevariance of the dependent variable can be tested usingconventional F tests. Koch et al. present an approachto adjusting conventional analysis of variance techniquesfor data from complex survey designs; that is, for casesin which the iid assumption is not valid. A furtherreview and example are presented by Freeman wherehe labels this method the KFF method. In theseexamples the method of weighted least squares is usedto estimate the vector /1 and a consistent estimator (suchas a resampling technique) is used to estimate thecovariance matrix for /1.

Similarly, Skinner et al. discuss the followingprocedure, which is followed in this study. Let n bethe diagonal matrix of inclusion probabilities for thesampled units and consider again the model in (1). Theweighted least squares estimator for the parametervector, /1, for a model of full rank is

~=(X'II-1X)-lx/n-ly' (3)

This is the product of the design unbiased, Horvitz-Thompson estimators for X'X and X'Y (Shah et al.).The estimator in (3) is model unbiased under the modelin (1) (Skinner et al.) and has true variance

l't~1Z)·Y~-(X/IrI~-IX/n-lvn-IX(r/II-I~-I. (4)

If the estimator for the covariance matrix given in (4)is consistent, then the techniques described for the iidcase can be used for tests of significance. In this case,the Wald statistic in (2) is approximately distributedchi-squared. Standard statistical software can be usedto conduct the necessary regression, and variousresampling techniques are available for consistentestimation of (4). Rao et al. present some recent workconcerning the use of the jackkni fe, balanced repeatedreplication and the bootstrap methods for inference withcomplex survey data. A discussion concerning whichmethod is most appropriate is beyond the scope of thispaper. This is one of several areas requiring furtherinvestigation.

Application to Mean Imputation

Kalton and Kasprzyk explain how most modelsunderlying imputation or reweighting adjustments fornonresponse fit the form of the general linear model in(1) where the design matrix X defines the inclusion ofan observation in a reweighting or imputation cell andmay also include auxiliary variables, and fJ is an effects

90

vector. One method of adjusting for nonresponseinvolves defining population cells in which nonresponseis assumed or known to be ignorable; that is, we wantto form cells in which the nonrespondents areconsidered to be a simple random sample of the originalsampled units in that cell and the within-cell variance issmall. By conducting an analysis of variance, thecontribution of a variable in defming homogeneousgroups can be measured by testing if its coefficient issignificantly different from zero.

The National Agricultural Statistics Service(NASS) of the United States Department of Agriculturecollects crop, livestock and grain stock inventory datathrough a series of sample surveys. NASS drawssamples from a list frame and an area frame andconducts concurrent surveys each June. NASS' areaframe consists of the land area of the United States.Each county within each state is stratified based on landuse and, in the event of agricultural land, percentcultivation. Substratification is performed based on thetype of agricultural activity (crop, livestock, etc.). Atwo (or sometimes three) stage sampling process selectssegments of land for enumeration in June. A segmentis a cluster of tracts. Tracts are defined as areas ofland within a segment under one operating arrangement.Each tract is associated with an operator so that it canbe matched against the list frame. Tracts from the areaframe sample that were found to be not eligible forsampling from the list frame, labeled non-overlap(NOL) tracts, are identified and used as a measure ofthe incompleteness of the list. The expanded(weighted) data from NOL tracts are combined withexpanded (weighted) data from the list sample to derivemultiple frame totals. NOL tracts are restratified basedon data collected in the June survey. Based on the newstrata, a stratified simple random subsample of the NOLtracts is drawn for each subsequent, or "follow-on·,survey in the twelve month survey cycle following theJune survey. As in June, the totals from the areasubsample are added to totals from the list sample forfull population multiple frame totals. Thus, the overallsample design for selecting NOL tracts for follow-onsurveys is a two phase stratified design.

NASS uses agricultural statistics districts(geographical delineations) within each state to defineimputation cells to adjust for item nonresponse in theNOL sample of the follow-on surveys. This study wasinitiated to detennine the appropriateness of these celldefinitions. Because of data abundance and the onephase design, June NOL observations were used in thisstudy rather than follow-on survey observations. Theone phase design is easier to mimic when resampling.

The model analyzed in this study is thefollowing:

Page 95: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

where: y;j=value of survey item for unit iji = 1, ... ,d agricultural statistics districts

where 1<d <9 for a given statej = 1, ... ,11; observationsIL = the population mean for the stateaj = the effect of the ilb ASDe;j= the residual for the ijlb observation

where E(e..) = 0 and V(e..) = q2.IJ 1J 1

Although the above model groups theobservations into subpopulations (districts), these groupsare aggregates of survey design clusters or strata.Therefore, the aggregated approach as described bySkinner et al. was adopted for this study. The objectivewas to determine if separating the responses intodistricts aids in partitioning the variance. Resultsindicating that districts significantly partition thevariance of reported values for the survey item (e.g.cropland acres, total number of hogs, etc.) wouldprovide support for the assumption that the cells arehomogeneous groups of iid observations and thatnonrespondents are a simple random sample within eachcell. It is not an objective to derive a predictive modelfor the dependent variable. Indeed, the analysis ofvariance models in this study are not full rank.Therefore, the solutions to the normal equations are notunique and cannot be used for prediction.

The bootstrap technique described by Rao etaI. was used, with 250 iterations, to estimate thevariance given in (4). The null hypothesis was Ho:K'.B=m, where K' = [Od.! 1d.! -1· ~I] and m=O.Index d is the number of districts in the state; i.e. forthe fixed effects, aj' i ranges from 1 to d. Defining K'in this manner tests that all aj's are equal or,effectually, that all aj's equal zero. For a goodexplanation of why this is so, the reader is referred toSearle.

The cell means were tested for significantdifferences by calculating a p-value which is defined asthe probability that a random variable that is distributedFcI'I,~(cI+')is greater than the observed value of the Waldstatistic divided by its degrees of freedom. That is:

p-value = Proh[ FcI•l• n-(d+ll > Qk/(d-1) ].

When the denominator degrees of freedom, n-(d+ 1), islow, using the F distribution as described above offersa refinement over comparing Qk to a X2d., value(Skinner et al. p. 79). For the hypothesis describedabove, a small p-value would indicate that the districtmeans are significantly different.

91

Results obtained in this study indicate thatdistricts aid in the separation of the variance of theagricultural survey items in this study. The analysiswas conducted using only positive data. The objectivewas to test if any difference existed between districtmeans for those farms that had the item of interest.Therefore, for example, when analyzing cropland acres,only those farms with reported cropland acres greaterthan zero were included. The four survey items thatwere tested are as follows, with the number of observedp-values that were less than . lover the number ofstates included in the analysis given in parentheses:cropland acres (36/48), number of hogs (13/30), on-farm grain storage capacity (29/36) and winter wheatharvested acres (6/15). States were excluded from agiven analysis if, as in the cases of Alaska and Hawaii,they are not included in the survey program, or therewere too few observations for the survey item. Forexample, Rhode Island has a small number of hogoperations and, therefore, was not included in theanalysis of total number of hogs. Also, NOL sampledunits are use.d as a measure of list incompleteness.Therefore, we are dealing with fewer observations thanif data from the list frame sample were used.

From these findings, it can be concluded thatdefining imputation cells based on agricultural statisticsdistricts partitions the population into morehomogeneous groups than if the cells were defined atthe state level. Cell means were found to besignificantly different in enough states that NASSshould not collapse imputation cells to the state level.It cannot, however, be concluded from this study thatdistricts partition the population into the mosthomogeneous groups. There may be other auxiliaryvariables available that partition the population intomore homogeneous groups.

Discussion

A natural question to ask is if anyimprovement in analysis was realized by accounting forthe sample design. Therefore, an analysis of variancewas also conducted ignoring the sample design andassuming iid observations. The number of states withan observed p-value of less than .1 over the number ofstates teste.d, given in parentheses, is as follows: forcropland acres (21/48), number of hogs (3/30), on-farmgrain storage capacity (14/36) and winter wheatharvested acres (8/15). The lower occurrence ofsignificance, at a .1 level, (except for winter wheatharvested acres) suggests that the stratification resultedin more precise estimates. The analysis of variance thataccounted for the stratification detected differences that

Page 96: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

the analysis assuming independent observations did not.The difference in results is most apparent with thenumber of hogs survey item. With only three of thirtystates showing significance, one would conclude thatdistricts do not aid in partitioning the variance. Theestimated design effects in these cases would be lessthan one.

In conclusion, the strategy outlined here forconducting an analysis of variance is an effective toolthat can be used to determine the effectiveness of theimputation cells rather than relying solely on expertopinion, as is often done.

References

Freeman, D. H. Jr. (1988), "Sample Survey Analysis:Analysis of Variance and Contingency Tables,"Handbook of Statistics 6, eds. P.R. Krishnaiah andC.R. Rao, New York: North Holland, 415-426.

Kalton, G. and Kasprzyk, D. (1986), "The Treatmentof Missing Data," Survey Methodologv, 12: 1, 1-16.

Koch, G. G., Freeman, D. H. Jr. and Freeman, J. L.(1975), "Strategies in the Multivariate Analysis of Datafrom Complex Surveys," Int. Stat. Rev., 43:1, 59-78.

Rao, J. N. K., Wu, C. F. J. and Yue, K. (1992),"Some Recent Work on Resampling Methods forComplex Surveys," Survey Methodology, 18:2, 209-217.

Searle, S. R. (1971), Linear Models, New York: JohnWiley and Sons, p. 240.

Shah, Babubhai V., Holt, M. M. and Folsom, R. E.(1977), "Inference About Regression Models FromSample Survey Data," Bulletin of the InternationalStatistical Institute, 41 :3, 43-57.

Skinner, C. J., Holt, D. and Smith, T. M. F. (1989),Analvsis of Complex Surveys, New York: John Wileyand Sons.

92

Page 97: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

A COMPARISON OF FOUR ALTERNATIVE WEIGHTED ESTIMATORS TO THEOPEN ESTIMATOR FOR USE IN THE AGRICULTURAL LABOR SURVEY

Cheryl L. Turner, USDA/OASS200 N. High, Room 608, Columbus, OH 43215

KEY WORDS generated from a larger group of farm operators.

June Agricultural Survey, Agricultural laborSurvey, non-overlap, open estimator, weightedestimator

INTRODUCTION

The National Agricultural Statistics Service(NASS) within the United States Department ofAgriculture (USDA) annually conducts a JuneAgricultural Survey (JAS). The JAS is a multipleframe survey, consisting of both a list frame andan area frame. The area frame is stratifiedaccording to land usage or the percent ofcultivation. The area frame is further subdividedinto overlap (Ol) and non-overlap (NOl) domains.The overlap portion of the area frame iscomposed of farming operations which are alsofound on the list frame. The non-overlapcontains those farming operations which are notfound on the list frame.

The JAS begins the survey year and is the largestsurvey of the year for NASS. Follow-on surveysamples are derived from a list sampling frameand a sample of the area frame. The AgriculturalLabor Survey (AlS) is a multiple frame follow-onsurvey. It provides estimates of the number offarm workers and of the wage rates paid to thosefarm workers. Currently, the non-overlapestimate for the AlS is derived using an openestimator. The ALS open estimator is based ona sample of NOL Resident Farm Operators(RFO's) from forty percent of the area segmentsused in the JAS. (A segment is a piece of landthat is the primary sampling unit in the NASSarea frame sampling plan.) By definition, theopen estimator excludes all non-Resident FarmOperators. An alternative to the open estimatoris a weighted estimator. The weighted estimatoris generated from a sample of ill NOL farmoperators, both RFO and non-RFO. The weightedestimator has historically had a smallercoefficient of variation (CV) than the openestimator because the weighted estimate is

Four weighted estimators were evaluated forpossible use in the ALS. They were theoperational, modified weighted (modified),Hanuschak-Keough strata mean (H-K mean), andthe Hanuschak-Keough strata median (H-Kmedian). Each weighted estimator wascompared against the current open estimator.

This report represents the comparative analysisdone on these alternative weighted estimators.All estimators used the • peak number of hiredworkers· from 1991 JAS data. The JAS areaQuestionnaire obtains the expected • peak numberof hired workers· for the survey year. Thisnumber is then used to define the NOL strata forthe foJlow-on ALS. This study was doneindependently on both the 17 labor regions andthe eleven monthly and seasonal states.

STUDY DESIGN

Data for this survey were collected during the1991 JAS and represent the NOL domain. Theitem of interest was the peak number of hiredagricultural workers for the survey year. Thedata were evaluated at the regional level and atthe state level (for the eleven monthly andseasonal states). There are 17 labor regionswithin the United States. They are defined asfollows:

Reaion

Northeast I:Connecticut, Maine, Massachusetts, NewHampshire, New York, Rhode Island, Vermont

Northeast II:Delaware, Maryland, New Jersey, Pennsylvania

Appalachian I:North Carolina, Virginia

Appalachian II:Kentucky, Tennessee, West Virginia

93

Page 98: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

Southeast:Alabama, Georgia, South Carolina

lake:Michigan, Minnesota, Wisconsin

Cornbelt I:Illinois, Indiana, Ohio

Cornbelt II:Iowa, Missouri

Delta:Arkansas, louisiana, Mississippi

Nonhern Plains:Kansas, Nebraska, North Dakota, South Dakota

Southern Plains:Oklahoma, Texas

Mountain I:Idaho, Montana, Wyoming

Mountain II:Colorado, Nevada, Utah

Mountain III:Arizona, New Mexico

Pacific:Oregon, Washington

Florida:• • Florida

California:• • California

•• Note that Florida and California are singlestate regions.

The monthly states are California, Florida, NewMexico, and Texas. Michigan, New York, NonhCarolina, Oregon, Pennsylvania, Washington, andWisconsin are the seasonal states.

THE WEIGHTED ESTIMATORS

Two types of estimators were being evaluated,an open and a weighted estimator. For an openestimator, the location of the operator'sresidence is used to uniquely associate everyfarm with only one segment. A weight of one isassigned if the tract operator lives within theselected segment (if the tract operator is anRFO), and a weight of zero is assigned otherwise.Conversely, the weighted estimator apponions afarm's activities to a segment by weighing thedata relative to the fraction of the farm's acreagethat lies within the segment boundary.Therefore, one farm may contribute to the data inseveral segments.

94

As stated earlier, the AlS open estimator isbased on a sample of NOl RFO's from fortypercent of the area segments used in the JAS.

In contrast, an AlS weiahted estimator would bebased on the same sample size being selectedfrom ill NOl operations (both RFO's and non-RFO's) from the same forty percent of areasegments. The respondents selected using anAlS weighted estimator would have beenselected from a larger pool of potentialrespondents. In sampling from the larger pool ofrespondents, there is the potential for a reductionin the CV.

The operational, H-K mean, H-K median, and themodified weighted were the four weightedestimators being evaluated.

Ooerational

The operational weighted estimator is theweighted estimator traditionally used in NASSsurveys. It merely assigns an ·operational-weight of tract acres divided by total farm acresfor each farming operation even partiallycontained within the segment. (Where the tractacres are the acres residing within a sampledsegment.) This estimator prorates farm leveldata to the segment level.

Hanuschak-Keouah strata mean and median

These two weighted estimators are similar to theoperational weighted estimator, but they attemptto limit potential outliers by controlling the valueof the weight. There are occasions when theexact farm acreage is neither obtainable norknown. This happens when the respondenteither would not or could not give the correctfarm acreage. In these instances the tractacreage and farm acreage may be recorded asequal (plus perhaps a token acre for thefarmstead) on the JAS. Although this problemhas been recognized and emphasized at trainingschools, it still exists (but to a lesser degree).Hanuschak and Keough proposed a solution forthis specific type of problem. In some cases theequality of the tract and farm acres is accurate.However, if the farm acres should have beensubstantially larger than the tract acres, the·operational· weight would be nearly or equal toone when it should have been considerablylower. This problem leads to a great

Page 99: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

overexpansion of the survey data. Andconversely, there could be an underexpansion ofthe survey data if tract acres were underreported.

Hanuschak and Keough recommended a morerobust estimator than the standard ·operational·weight. A robust estimator is relativelyinsensitive to slight departures from theassumptions of normality. The Hanuschak-Keough estimators replaced the • operational·weight with a robust weight for all NOL tracts (orobservations) in which someone other than theoperator or the operator's spouse responded.The Hanuschak-Keough estimators will guardagainst large overexpansions or underexpansionsof the survey data. Consider the followingrespondent codes as defined in the JAS survey:

ResDondent Code1 = Operator/Manager2 = Spouse3 = Other4 = Observed Refusal5 = Observed Non-refusal

The Hanuschak-Keough estimators replaced the·operational· weight for all NOL observationscontaining respondent code 3, 4, or 5 with amore robust weight. Within each land use strata,the Hanuschak-Keouah strata mean estimatorreplaced the denominator of the ·operational·weight for those observations containingrespondent codes 3, 4, or 5 with the averagefarm acreage from the respondent code 1 and 2observations. The Hanuschak-Keouah stratamedian estimator replaced the denominatorwithin each land use strata for those sameobservations with the median farm acreage fromthe respondent code 1 and 2 observations.

Modified Weiahted

The modified estimator was originally proposedby Bosecker and Clark. It is an effort to eliminatescreening for farm operators in densely populatedsegments. In reducing the amount of surveyscreening, the cost of conducting the survey isgreatly reduced.

The modified estimator is especially suited to themeasurement of rare populations, and thenumber of farm operators among the general

95

population (particularly in residential areas)certainly Qualifies as rare. The modified weightedestimator will exclude up to one half acre fornon-agricultural land devoted to residentialpurposes (such as the house and yard). Forresidential agricultural tracts, the residential areawould be subtracted from the weight's numeratorand denominator; for non-resident agriculturaltracts, the residential area would be subtractedjust from the weight's denominator. Since themodified weight would be zero for small tractsconsisting of only a house and yard, screeningfor farm operators in residential areas would beunnecessary. The modified weight assumed 1/2acre for all residences, except where it wasknown that the farmstead was less than 1/2acre.

The expanded peak number of hired workers wascalculated using the open estimator and each ofthe alternative weighted estimators.

ANALYSIS

NOL estimates were generated for the peaknumber of hired workers. Both the open and theweighted estimators were generated using thesame number of tracts and the same tractinformation. Identical analyses were used toindependently compare each of the fouralternative estimates with the current openestimate of the peak number of hired workers.Univariate paired Hests were conducted at theregional level for the 17 regions and at the statelevel for the eleven monthly and seasonal stateson each alternative estimator versus the openestimator. These Nests will determine if thealternative estimate was significantly differentfrom the open estimate. The paired t-test willtest the following hypotheses for each alternativeestimate:

Ho: Yclift = 0 versus H,,: Yclift < > 0

where Yclift =: alternative estimate - open estimate

RESULTS

Univariate paired t-tests were performed on thevariable peak number of hired workers. The tstatistics were calculated for both the 17 labor

Page 100: SURVEY METHODS FOR Agriculture FARMS, AND INSTITUTIONS ..._Presen… · data. The same method was used to calculate county estimates of rice. cotton and soybeans in the Mississippi

regions and the eleven monthly and seasonalstates for each of the four weighted estimatesversus the open estimate.

labor Reaion Results

The test results indicated that most of thecomparisons yielded insignificant differences(alpha = .05) at the regional level. Therefore,there were negligible differences between each ofthe four alternative estimators and the openestimator for these regions.

The test results also indicated that somesignificant differences (alpha = .05) did exist atthe regional level. Significant differencesbetween each of the four alternative estimatesand the open estimate existed in the Delta regionand the Southern Plains region. In theAppalachian II region and the Southeast region,significant differences existed for all comparisonsbut the H-K mean estimate and the openestimate. Significant differences existed in thePacific region between each the operational andmodified estimates and the open estimate. Andlastly, the Northern Plains and California regionsobtained significant differences between the H-Kmedian estimate and the open estimate.

As stated above, both the Delta and SouthernPlains regions obtained significantly differentresults for the four alternative estimators ascompared to the open estimate. Furtherexamination of these two regions shows thatArkansas, louisiana, and Texas were thedominating states within their respective regions.All states were significantly different with respectto the alternative estimate vs. the open estimate.When Arkansas, louisiana, and Texas wereevaluated individually, one tract often accountedfor the majority of difference between thealternative estimates and the open estimate.

For example, within Texas there was one tractwhich made no contribution to the peak numberof hired workers for the open estimate. But foreach of the four alternative weighted estimates,this tract alone contributed between four andeight percent of Texas' state level expansion forthe peak number of hired workers. Thedifferences in these estimates were due in part tothe farmer living outside of the selected segment(and therefore having an open weight of 0), whileat the same time having a positive number of

96

hired workers.

In following with previous findings, the openestimate was the lowest estimate (due to adownward bias) in 12 of the 17 regions, whilethe H-K median was the highest estimate in 11 ofthe 17 regions. The operational, H-K mean, andmodified estimates were most often foundbetween these two extremes.

The CV for the open estimator was the largestCV in 13 of the 17 regions. This supports thenotion that sampling from a smaller sample size(only the RFO's) will increase the CV. The CV'sfor the four weighted estimators were (overall)considerably smaller than those for the openestimator, but none of the alternativesdistinguished itself as having the lowest CV.

State level Results

Mostly insignificant differences (alpha = .05)also existed at the state level. And as with theregional level results, this indicated that therewere negligible differences between each of thefour alternative estimators and the openestimator for the monthly and seasonal states.

The test results at the state level also indicatedthat some significant differences (alpha = .05)did exist. Significant differences between all fourof the alternative estimates and the openestimate existed only in Texas. There weresignificant differences in Washington betweenthe operational estimate and the open estimateand also between the modified weighted estimateand the open estimate. A significant differencealso existed between the H-K median estimateand the open estimate in California.

Also, as with the regional results, the estimateswere lowest for the open estimator in 7 of the 11states and the estimates were highest for the H-Kmedian estimator in 8 of the 11 states. Theoperational, H-K mean, and modified estimatorswere barely distinguishable from each other, eachlying between the two extremes. And, again theopen estimator CV as the largest CV in 7 of the11 states. The four weighted estimator CV'sagain obtained smaller CV's than the open CV,while not substantially differing from oneanother.