Academia.eduAcademia.edu
Available online at www.sciencedirect.com Computers, Environment and Urban Systems 32 (2008) 214–232 www.elsevier.com/locate/compenvurbsys A comparison of address point, parcel and street geocoding techniques Paul A. Zandbergen * Department of Geography, Bandelier West Room 111, MSC01 1110, 1 University of New Mexico, Albuquerque, NM 87131, USA Received 11 January 2007; received in revised form 15 November 2007; accepted 19 November 2007 Abstract The widespread availability of powerful geocoding tools in commercial GIS software and the interest in spatial analysis at the indi- vidual level have made address geocoding a widely employed technique in many different fields. The most commonly used approach to geocoding employs a street network data model, in which addresses are placed along a street segment based on a linear interpolation of the location of the street number within an address range. Several alternatives have emerged, including the use of address points and parcels, but these have not received widespread attention in the literature. This paper reviews the foundation of geocoding and presents a framework for evaluating geocoding quality based on completeness, positional accuracy and repeatability. Geocoding quality was com- pared using three address data models: address points, parcels and street networks. The empirical evaluation employed a variety of dif- ferent address databases for three different Counties in Florida. Results indicate that address point geocoding produces geocoding match rates similar to those observed for street network geocoding. Parcel geocoding generally produces much lower match rates, in particular for commercial and multi-family residential addresses. Variability in geocoding match rates between address databases and between geo- graphic areas is substantial, reinforcing the need to strengthen the development of standards for address reference data and improved address data entry validation procedures. Ó 2007 Elsevier Ltd. All rights reserved. Keywords: Geocoding; Reference data; Address models; Address points; Parcels 1. Introduction complex rules that define how an address is broken down for geocoding, local datasets require better positional accu- Addresses are one of the fundamental means by which racy since the geocoded data is often analyzed in relatively people conceptualize location in the modern world. In a small geographic units. With the increasing power of GIS Geographic Information System (GIS) addresses are con- there has also been an increase in commercial vendors that verted to features on a map through the geocoding process. provide custom geocoding tools and reference data (e.g. Much literature has been written on the topic of geocoding NavTech and TeleAtlas). Additionally, web-based address and the underlying algorithms that make it function. Ques- look-up engines such as Google Maps, MapQuest, and tions such as, ‘‘What is an acceptable match rate?” (Ratc- Yahoo Maps have become mainstream tools among the liffe, 2004) and ‘‘How do different algorithms affect the general public, making even greater the demand for accu- geocoding result?” (Karimi & Durcik, 2004) are readily rate geocoding. encountered in the literature. Whatever the application, The general purpose of this paper is fourfold: (1) to the primary concern generally relates to the accuracy of a review the foundations of the geocoding process; (2) to geocoding technique. National datasets and local datasets review address data models used in geocoding; (3) to pres- face different challenges. While large national datasets must ent a framework for evaluating geocoding quality; and (4) contend with a diversity of address formats, requiring more to present the results of an empirical comparison of geo- coding match rates using different address data models. * Tel.: +1 505 277 3105; fax: +1 505 277 3614. The review portion of this paper complements recent E-mail address: zandberg@unm.edu reviews of geocoding by Rushton et al. (2006) and 0198-9715/$ - see front matter Ó 2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.compenvurbsys.2007.11.006 P.A. Zandbergen / Computers, Environment and Urban Systems 32 (2008) 214–232 215 Goldberg, Wilson, and Knoclock (2007); this paper focuses Within this probabilistic system, each field participating more on address models and issues of geocoding quality. in the linkage comparison is subject to error which is mea- The empirical component in particular will examine the sured by the probability that the field agrees versus the influence of geocoding techniques relative to the influence probability of chance agreement of its values. The assign- of variability in input address quality on geocoding match ment of such probabilities is intended to mimic a human rates. decision making process. The general information flow in a probabilistic record 2. Geocoding foundations linkage system can be grouped into seven main categories (Gu et al., 2003): data, standardization, searching/block- While simple in concept, geocoding as a process is not as ing, selection of attributes for matching/comparison, simple as just putting a dot on a map. Techniques involved weights, the decision model, and performance measure- in geocoding borrow from various disciplines, most nota- ment. Data includes datasets from different sources that bly, information theory, decision theory, probability the- need to be linked. Standardization is used next to replace ory, and phonetics. What follows is a brief review of the spelling variations of commonly occurring words with a fundamental concepts of the geocoding process. standard spelling. Without standardization many true matches could be wrongly designated as non-matches 2.1. Geocoding process because the common identifying attributes do not have suf- ficient similarity. After standardization, searching/blocking Geocoding is the process of assigning an XY coordinate is used to reduce the number of comparisons of record pair to the description of a place by comparing the descrip- pairs by bringing only the linkable pairs together. A good tive location-specific elements to those in reference data. attribute variable for blocking should contain a large num- The geocoding process is defined as the steps involved in ber of attribute values that are fairly uniformly distributed. translating an address entry, searching for the address in Such an attribute must have a low probability of reporting the reference data, and delivering the best candidate or can- errors. The ideal blocking component would be one which didates as a point feature on the map. Generally, these nearly always agrees in ‘‘true match” record pairs but steps include parsing the input address into address compo- nearly always disagrees between pairs which are not valid nents (such as street name, street type, etc.), standardizing matches (Jaro, 1984). Due to their key role in defining loca- abbreviated values, assigning each address element to a cat- tion, street names are generally used as the blocking mech- egory known as a match key, indexing the needed catego- anism in address geocoding tools. Soundex, described in a ries, searching the reference data, assigning a score to following section, is then used to create an index on the each potential candidate, filtering the list of candidates street name attribute to reduce the possibility of a false mis- based on the minimum match score, and delivering the best match due to misspellings. Therefore, if the street name (or match. rather, the Soundex index value generated from the street While geocoding applications are diverse and span many name) is not found, no possible matches are suggested. A types of applications, there are several common problems common geocoding error is the incorrect standardization associated with geocoding that have traditionally caused of an address like ‘‘1300 North Star Rd” to ‘‘1300 N Star poor match rates, requiring excessive manual mapping by Rd”. Due to pre-defined look-up tables that define the user and potential inaccuracies and/or incompleteness ‘‘North” as a directional prefix, through the standardiza- in the resulting spatial datasets. tion process ‘‘North” is stripped from the street name and a Soundex code is generated for ‘‘Star” rather than 2.2. Probabilistic record linkage ‘‘North Star”, and consequently a match is not found, resulting in a false negative result. The next step involves Probabilistic record linkage is the process of matching the selection of attributes for matching/comparison. Com- two data files under conditions of uncertainty. The objec- mon attributes should be selected for use in the comparison tive is to identify and link records which represent a com- function. Major components are generally predefined in mon entity whether the entity is an individual, a family, addresses, such as house number, prefix direction, street an event, a business, an institution, or an address. Probabi- name, and street type. However, there are significant regio- listic record linkage systems use a form of fuzzy logic to nal variations in many parts of the United States and the score how well records do or do not match. The concept world (which might incorporate additional components, is in contrast to deterministic record linkage which assumes such as zone, street suffix, prefecture, etc.) which requires error-free identifying fields and links records that match the use of custom locator styles for differing datasets. A exactly on these identifying fields. For example, the ability comparison vector (a weight) is then used for each pair to join database records on matching primary and foreign based on assigned weights. The discriminating power of a keys is an example of deterministic linkage. When no error component (such as an address element) is a measure of free identifier is shared by all of the data sources, a proba- how useful that component is in predicting a match. If bilistic record linkage technique can be used to join data the components are assumed to be statistically indepen- sources (Gu, Baxter, Vickers, & Rainsford, 2003). dent, then the composite weight is equal to the sum of 216 P.A. Zandbergen / Computers, Environment and Urban Systems 32 (2008) 214–232 the individual component weights (Jaro, 1984). For each potential errors. Testing of Soundex has shown it to pro- record pair, a decision is then made whether to classify duce a high number of incorrect matches (Stanier, 1990), the pair as a match (M), a non-match (U), or tie (T) which and improvements have been suggested (Christian, 1998). must be followed-up interactively by the user to manually Despite its limitations, Soundex is currently implemented specify the correct match. in most geocoding software, but other types of probabilis- tic record linkage that do not rely on Soundex have been 2.3. Standardization developed (Christen, Churches, & Zhu, 2002). Standardization is a part of the probabilistic record link- 3. Address data modeling age process. However, it is such a specialized component in GIS that it deserves individual attention. The most com- One of the main challenges to accurate geocoding is the mon approach for name and address standardization is availability of good reference data. This includes a set of the manual specification of parsing and transformation geographic features that are needed to match against as rules. An input string is first parsed into individual words. well as robust address characteristics (attribute data) that Each word is then mapped to a token of a particular class enable matching address records to feature locations in a (Churches, Christen, Lim, & Zhu, 2002). The choice of GIS. This requires a sturdy address model to organize class is determined by the presence of that word in user- the reference data components in a logical, maintainable supplied, class specific lexicons, or by the type of characters and site-specific way. found in the word such as all numeric, alphanumeric or There are many challenges to building good reference alphabetical. The process described above is a deterministic data (Arctur & Zeiler, 2004). Addresses can be associated approach, meaning a one-to-one match must be found for with many kinds of feature classes in a reference database; certain address components (such as house number or for example, road centerlines, parcel boundaries, address street type). Churches et al. (2002) present a probabilistic points, building structures, etc. The complexities of address method for standardization using Hidden Markov Models component relationships might also dictate that some (HMMs) as an alternative. At present few practical imple- address elements be organized in separate, related tables mentations of HMMs have emerged for geocoding with the since addresses and features in the GIS can share complex notable exception of the Geocoded National Address File relationships (such as many-to-many). A feature might also (G-NAF) for Australia (Christen, Churches, & Willmore, have sub-addresses. For example, a parcel may house a 2004). duplex with two separate addresses. Sets of address compo- nents can also vary by locale and culture. 2.4. Soundex Several common address models exist. Each has a par- ticular set of supporting materials and characteristic errors. Soundex is a way of indexing information based on how The first one can be characterized as the ‘‘geographic unit” the word sounds rather than how it is spelled. It is a pho- model. These geographic units can consist of postal codes netic indexing system, blocking together many of the com- (such as ZIP codes in the United States), Counties, cities, mon types of spelling errors and abbreviations. Most census enumeration areas or any other geographic bound- versions of Soundex convert a text string into a code con- ary considered meaningful. In the geocoding process the sisting of the first (leftmost) letter of the string, followed by location assigned to a particular address is the polygon 3 or more digits (Patman & Shaefer, 2001). The method is (or the polygon centroid) representing the geographic unit. based on the phonetic classifications of human speech Location within the unit is not specified, but analyses can sounds, which in turn are based on where you put your lips be carried out using data associated with the geographic and tongue to make the sounds. The key concept behind unit. Postal codes are particularly attractive since this type Soundex relies on the assumption that a constant relation- of information is much easier to obtain than individual ship between letters and sounds should assure that similar- street address information and postal code data also tends sounding names are assigned the same code. Soundex also to be very complete and accurate. For example, most peo- functions as a compression scheme since the code contains ple know their postal code and are less likely to provide one half to two thirds the information contents of the full misspellings or alternative descriptors than for street name (Winkler, 1999). Within the geocoding process a addresses. The utility of the results is obviously related to Soundex index is commonly applied for the street name the size of the geographic units. For example, in the United component of the standardized address. States 5-digit ZIP codes tend to be quite large, typically lar- Some of the limitations of Soundex that must be taken ger than census tracts, making them less attractive when into account include (Patman & Shaefer, 2001): sensitivity spatially detailed information is required. In several other to spelling variations, the algorithm’s dependence on the jurisdictions the postal code system is much finer grained initial letter, noise intolerance (mistyping, extra conso- and can provide a fairly accurate location. For example, nants, swapped consonants), differing transcription sys- Canada uses a 6-character postal code. The Postal Code tems, names containing particles, perceptual differences, Conversion File developed by Statistics Canada and Can- silent consonants, and the use of initials, among other ada post contains the geographic coordinates of each P.A. Zandbergen / Computers, Environment and Urban Systems 32 (2008) 214–232 217 postal code. In major urban areas a single 6-character allows for some flexibility in modifying street names to bet- postal code typically corresponds to a single block-face ter fit geocoding rule base expectations. (Statistics Canada., 2002). An empirical validation study by Bow et al. (2004) determined that for a sample of 3.2. Model 2: Parcel boundaries data model addresses in the City of Calgary 87.9% of postal code loca- tions were within 200 m of the true address location and Parcel boundaries are traditionally the most spatially 96.5% were within 500 m using straight-line distance. accurate data with address information available. Geocod- For many application that do not require individual- ing against parcels allows for matching against individual level locations, geocoding at the postal code level might plots of land (or the centroids of those polygons) rather be very appropriate, in particular since match rates at this than interpolating against a street centerline. This is partic- level are typically very high. When geocoding at the level of ularly useful in areas where parcels are not regularly postal codes is not sufficient, several alternatives exist, addressed (such as on roads with mixed parity) or those including street networks, parcel boundaires and address parcels that may be quite a distance from the centerline. points. Each of these three address models will be described A principal difference between parcel and street geocod- in more detail below. ing is that a single parcel usually has a single house num- ber, while a single street segment has an address range. 3.1. Model 1: Street network data model This implies that a match is only obtained in parcel geocod- ing if there is a perfect match for the house number; for The most widely employed address data model is based street geocoding a match is obtained if the house number on street network data. In this approach a street network is being matched falls within the address range for the street represented as street line segments that hold street names segment. In effect street geocoding does not provide for a and the range of house numbers and block numbers on check if the house number actually exists and can therefore each side of the street. Address geocoding is accomplished more easily result in false positives (i.e. produce a match by first matching the street name, then the segment that for a non-existing address location). This is one of the rea- contains the house numbers and finally placing a point sons why parcel geocoding typically results in a lower along the segment based on a linear interpolation within match rate. Another perhaps more important reason why the range of house numbers. An optional off-set can be parcel geocoding produces lower match rates than street employed to show on which side of the street line segment geocoding is that a single parcel can be associated with the address is located. This approach to geocoding an many addresses; for example, duplex units, condominiums, address is referred to as ‘‘street geocoding” and has become apartment complexes, commercial sites, etc. While the par- the most widely used form of geocoding. Nearly all com- cel may have an address, the addresses of individual struc- mercial firms providing geocoding services and most GIS tures or units on the parcel are not always captured in the software with geocoding capabilities rely primarily on parcel database. street geocoding. While geocoding using parcels is more spatially accurate The street network address model facilitates storing dif- than geocoding using streets, parcel data may not necessar- ferent names and address ranges for different sides of the ily constitute all valid addresses within an area. Addition- street and enables validation of cases where there is no ally, not all parcels have a true address. Some may have address range for one side of the street. It also supports an abstract number or a non-standard reference listed in cases where streets have multiple address ranges and the address fields. names. Some additional attribute characteristics include Despite the often lower match rates, parcel geocoding is the use of full block address ranges for major roads, while generally considered more spatially accurate and is now true address ranges are commonly used for residential becoming widespread given the development of parcel level roads. For better interpolation results, it is generally pre- databases by many cities and Counties in the United States ferred to geocode with as much block-face accuracy as pos- (Rushton et al., 2006). sible (that is, against true address ranges). While this results in a better spatial location for known valid addresses, this 3.3. Model 3: Address point address data model can also be problematic. When approximated addresses are geocoded against the centerline the records fail to match To overcome the limitations of parcels for geocoding, since the value does not fit into the existing range. An address points have emerged as a third address data model. example of this situation would be an address like ‘‘300 The address point data model is often derived from a mas- [block] E Main St” when the known address range may ter address file (MAF) of all known addresses, which is fre- run only from 315 to 345. Thus, some padding may be quently available in the form of an E911 address list required even for true address ranges, and other means compiled for emergency response purposes. Address point are needed for geocoding approximated data. Mixed parity data can also be constructed from several existing data lay- issues also exist for some roads which throws off interpola- ers such as parcel data. Address points are created from tion techniques. Street name alias fields may exist in the parcel centroids for all occupied parcels (or points can be attribute table since naming standards can vary. This also placed elsewhere within the parcel, such as the location 218 P.A. Zandbergen / Computers, Environment and Urban Systems 32 (2008) 214–232 of the main structure or in front of the main structure). health-related issues such as zones of exposure and rates This is supplemented with address points for sub-addresses of disease. Three major categories of geocoding error can such as individual apartment units, condominium units, be identified: (1) data input errors, (2) reference data errors, duplexes etc. which are not recorded as separate properties and (3) errors related to the underlying geocoding process. in the parcel data. Field data collection or verification of These errors can be broken down into more precise catego- building locations using digital aerial imagery can be used ries for analysis in order to facilitate problem solving. Tra- to further supplement the address point file. ditionally difficult addresses include apartment units, Both Australia and the United Kingdom have developed commercial suites, shopping center suites not addressed national address point databases. In Australia, this data- to the street centerline, and other troublesome address data base is part of the Geocoded National Address File. In anomalies. the United Kingdom, this database has been developed by the Ordnance Survey and is referred to as the 4.1. Match rates ADDRESS-POINT dataset. These two efforts have set the stage for other jurisdictions to develop similarly The simplest measure of geocoding quality is the match detailed and comprehensive address point databases. At rate, or the percentage of records that produce a reliable present, however, there has been limited published research match. An obvious question that emerges is: What is an on the quality of the geocoding based on either GNAF or acceptable match rate? Surprisingly, this question has ADDRESS-POINT. received limited attention in the literature. In one of the In the United States address point geocoding at present few studies on the subject, Ratcliffe (2004) employed is not in very widespread use. However, many local govern- Monte Carlo simulation of geocoded crime incidents aggre- ments have started to create address point databases and gated at the census block level to determine what minimum several commercial goecoding firms have started to provide match rate is needed to obtain a reliable pattern of crime address point geocoding for selected urban areas. incidents. Results indicated that to generate a statistically reliable pattern a match rate of 85% was necessary. In gen- 4. Geocoding quality eral, however, match rates reported by studies that have employed geocoding vary greatly since they depend on For the results of geocoding to be meaningful, the geo- many factors. There is no consensus on a universal stan- coding process needs to meet certain quality expectations. dard for an acceptable geocoding match rate. Despite the widespread use of geocoding in a range of dis- The match rates increases if efforts are made to increase ciplines, the errors of geocoding have not received wide- the quality of the address file and the geographic reference spread attention in the literature. Much research that file. Interpreting match rates, however, is very subjective uses geocoding as one of its methods does not include since much depends on the criteria used to characterize a any mention of the quality of the geocoding; if a reference ‘‘match”. For example, lowering the minimum match score is made to the quality, usually only the match rate or geo- will increase the overall match rate, but may inadvertently coding completeness is mentioned. Commercial geocoding introduce false positives. For a given real-world set of firms also commonly emphasize high match rates to addresses, there is thus a trade-off: increasing the match describe and promote their services, with little attention rate by lowering the minimum match score results in a to other aspects of geocoding quality. Recent research is decrease in accuracy and therefore geocoding quality. suggesting the emphasis on match rates is somewhat mis- placed and potentially misleading (Whitsel et al., 2004). 4.2. Positional accuracy The overall quality of any geocoding result can be char- acterized by the following components: completeness, posi- Several studies have determined quantitative estimates tional accuracy and repeatability. Completeness is the of the positional accuracy of geocoding. Estimates of ‘typ- percentage of records that can reliably be geocoded, also ical’ positional errors for residential addresses range from referred to as the match rate. Positional accuracy indicates 25 to 168 m (Bonner et al. 2003; Cayo & Talbot 2003; how close each geocoded point is to the ‘‘true” location of Dearwent, Jacobs, & Halbert 2001; Karimi & Durcik the address. Repeatability indicates how sensitive the geo- 2004; Ratcliffe 2001; Schootman et al., 2007; Strickland, coding results are to variations in the street network input, Siffel, Gardner, Berzen, & Correa, 2007; Ward et al. the matching algorithms of the geocoding software, and the 2005; Whitsel et al., 2006; Zandbergen, 2007; Zhan, Bren- skills and interpretation of the analyst. Geocoding results of der, De Lima, Suarez, & Langlois, 2006; Zimmerman, high quality are complete, spatially accurate and repeatable. Fang, Mazumdar, & Rushton, 2007) based on median val- Several studies have been published that seek to investi- ues of the error distribution. Results in urban areas are gate and evaluate the effectiveness of geocoding techniques generally more accurate than in rural areas (Bonner et al. and the quality of the final result. Most prevalent are those 2003; Cayo & Talbot 2003; Ward et al. 2005). It should associated with health database mapping due to the dra- also be noted that the occurrence of major positional errors matic increase in the number of public health applications is relatively common. For example, in one of the more using geocoding to assess geographical distributions of thorough studies by Cayo and Talbot (2003) 10% of a P.A. Zandbergen / Computers, Environment and Urban Systems 32 (2008) 214–232 219 sample of urban addresses geocoded with errors larger than ferent jurisdictions are used. The comparison emphasizes approximately 96 m and 5% geocoded with errors larger geocoding completeness (i.e. match rates) since positional than 152 m. For rural addresses these distances were 1.5 accuracy is inherently tied to the type of address data and 2.9 km, respectively. model used (i.e. address point and parcel geocoding pro- The positional error in geocoded addresses may duce more spatially accurate results than street geocoding). adversely affect spatial analytic methods. Specific effects includes inflation of standard errors of parameters esti- 5. Methods mates and a reduction in power to detect such spatial fea- tures as clusters and trends (Jacquez & Waller, 2000; 5.1. Study area Waller, 1996; Zimmerman, 2007). Even relatively small positional errors can have an impact on local statistics Reliable and complete reference information for address for detecting clusters (Burra, Jerrett, Burnett, & Anderson, point, parcel and street geocoding is not available for all 2002). Research on this topic has been mostly confined to areas. As a result, this study employed an extensive search the health field. For example, typical street geocoding is strategy to identify Counties with this type of reliable refer- not sufficiently accurate for the analysis of exposure to traf- ence data in GIS compatible format. The search was lim- fic-related air pollution of children at short distances of ited to the State of Florida; most Counties in Florida 250–500 m (Zandbergen, 2007; Zandbergen & Green, have undertaken major investments in GIS data over the 2007). Similar errors in misclassification of exposure poten- last two decades and access to address data of various types tial have been identified by Whitsel et al. (2006). is also generally good in part due to the requirements of the Sunshine Law (Florida Statutes Chapter 286) to make pub- 4.3. Repeatability lic records available. For each of Florida’s 67 Counties, GIS Departments The repeatability of geocoding has not received as much and Property Appraiser’s Offices were contacted with a attention as positional accuracy. In one recent study by request for digital copies of address point, parcel and street Whitsel et al. (2006) using a large sample (n = 3615) of centerline data in GIS format. A few Counties remained addresses in 49 United States, substantial differences were unresponsive to repeated requests, and several Counties found between four commercial vendors. There were do not maintain a GIS database, but ultimately digital data important differences among vendors in address match rate was obtained from 62 of the 67 Counties. (30–90%) concordance between established and vendor- Street centerline data was available for all 62 Counties, assigned census tracts (85–98%) and distance between and in most cases contained the proper fields required for established and vendor assigned coordinates (mean of geocoding. Parcel data was also available for all 62 Coun- 228–1809 m). This confirmed earlier findings by Whitsel ties, but did not always contain the proper fields. The first et al. (2004) for a much smaller sample that the repeatabil- priority in maintaining parcel data is not for geocoding, ity of commercial geocoding is not very good. The exact and therefore the completeness of the data is not always causes for the lack of repeatability are unknown, since sufficient for geocoding. Sometimes address information the geocoding algorithms and data quality procedures of is completely lacking and only legal descriptions are pro- commercial vendors are not disclosed. vided. Sometimes the address information is stored in a sin- In a comparison of three geocoding algorithms (Loc- gle field, making the creation of an address locator Match, ArcView 3.2 and Tele Atlas North America) using complicated. Despite these limitations, data from 35 Coun- the same TIGER reference data, Karimi and Durcik (2004) ties was deemed sufficient for geocoding. Development of found that the differences between the results were not sig- address point data has not received the same level of effort nificant. This suggests that differences in reference data are as street centerlines and parcels, and was available for only at least in part responsible for the observed differences 11 Counties. Since geocoding is often one of the main between commercial vendors. objectives in developing an address point database, their quality for this purpose is generally good and all 11 dat- 4.4. Study objective abases were considered adequate. Upon review of the three databases for each County, Several different address models for geocoding have only seven Counties were identified as having a reliable emerged, but very limited research has been carried out database for all three types. Of these seven, the three Coun- to determine their relative strengths and weaknesses. The ties with the largest population were selected based on sam- objective of the empirical component of this study, there- ple size considerations: Bay, Collier and Seminole County. fore, is to compare the reliability of the address point, par- The location of these three Counties is shown in Fig. 1. cel and street network data models for geocoding. This Based on this selection process, the databases for the three comparison is accomplished by geocoding the same address Counties are by not truely representative of what a typical databases using the three different address data models for GIS datatabase at the County level looks like. Instead, they the same geographic areas. To strengthen the comparison represent examples of the very best data available in several different types of address databases from three dif- terms of completeness, currency and appropriateness for 220 P.A. Zandbergen / Computers, Environment and Urban Systems 32 (2008) 214–232 Fig. 1. Location of Counties used in this study. geocoding. For the purpose of this study they represent a ance Corporation (FDIC) in March 2006 (n = 5,138). case-study of a best-case scenario which is not currently fea- Banks were selected as an example of commercial proper- sible at the State or national level in the United States, but ties with a very good address description due to their strict represents an objective to strive towards. The case-study will licensing. It was anticipated that the address information illustrate the current performance of this best-case scenario. from the FDIC would be very complete and standardized. Addresses for all registered child care facilities in Florida 5.2. Addresses for geocoding were obtained from the Florida Department of Children and Families (FDCF) in March 2006 (n = 13,564). The child care Six different databases were obtained for use in the com- facilities database was selected for this study since they parison of geocoding using three different address models. include both commercial and residential properties, i.e. The selection of these databases was driven by a number of licensed home child cares with a maximum of 10 children considerations. First, the database had to be publicly avail- are part of this database. It was expected that this database able to facilitate data access. Second, the database had to would not be as complete or standardized as some of the be recently updated (2005 or 2006) to prevent temporal other databases, since the licensing of child care is handled bias. Third, the database had to be available for the entire mostly by local authorities, which may vary across the State. State of Florida to allow for comparisons among the three Addresses for all properties with a licensed elevator were Counties. Fourth, sufficient sample size for each County obtained from the Florida Department of Business and was needed. And fifth, a range of different types of Professional Regulation (FDBPR), Bureau of Elevator addresses was needed, including residential, commercial Safety (n = 45,998). The elevator database was selected and other types. The following six databases were decided for this study since these properties contain both commer- upon: commercial banks, child care facilities, properties cial and residential multi-family units which are known to with elevators, establishments with food permits, saltwater be a challenge in geocoding. Since elevators are regulated recreational fishing license holders, and registered sex and inspected by the State of Florida, a high degree of offenders. Each will be briefly described below. address standardization and completeness was expected. Addresses for all branches of licensed commercial banks Addresses for all saltwater recreational fishing license in Florida were obtained from the Federal Deposit Insur- holders were obtained from the Florida Fish and Wildlife P.A. Zandbergen / Computers, Environment and Urban Systems 32 (2008) 214–232 221 Conservation Commission (FFWCC) in December 2005 Table 1 (n = 744,149). The fishing licenses were selected for this Sample size of address databases used for geocoding study as an example of a large database of mostly residen- Database Bay Collier Seminole tial addresses. It was also anticipated that the addresses in na Blanks na Blanks na Blanks this database would not be very complete or standardized Commercial banks 57 – 127 – 119 – since the address information provided by the applicant Child care – 54 – 104 – 124 – is completely self-reported, with very little data validation commercial or checking. Child care – 23 20 48 58 82 55 residential Addresses for all food establishment in Florida were Elevators – 316 – 1181 – 787 – obtained from the Florida Department of Agriculture commercial and Consumer Services (FDACS) in March 2006 Elevators – 251 – 1439 – 42 – (n = 40,780). Food establishments contain all those facili- residential ties were food items are processed and sold; the majority Fishing licenses 10,336 1108 11,116 1132 9815 1047 Grocery stores 452 – 691 – 891 – consists of supermarkets, grocery stores and convenience Sex offenders 289 – 189 – 306 – stores. Food establishments were selected for this study a Sample size after removal of records with blank addresses. as an example of commercial properties with a large sample size. Addresses for all sex offenders registered in Florida were prefix direction, street name, street type and suffix. Several obtained from the Florida Department of Law Enforce- datasets had fields for City or 5-digit ZIP code, but this was ment (FDLE) in December 2005. From the original data- not consistent – as a result, the use of a ‘‘zone” field (which base, only those offenders not in jail and with their latest commonly uses the ZIP or City field) was not feasible for known residence in the State of Florida were selected all reference data. (n = 18,551). Sex offenders were selected for this study as an example of mostly residential addresses, although it is known that some offenders reside in transient housing, 5.4. Geocoding process including hotels and motels. Each of the six databases contained fields for address, Address locators were created in ArcGIS 9 for the three city, County and 5-digit ZIP code. From each database reference datasets for each of the three Counties for a total the records associated with the three Counties of interest of nine address locators. Fields included in each locator were selected. included number, prefix direction, street name, street type Once the County-level databases were established (six and suffix. Additional fields were available in some cases types for three Counties for a total of 18 databases), the (usually a field for prefix type) but were not used to main- address fields were examined for any blanks, and these tain consistency between the address locators. No field was blanks were removed prior to geocoding. Blank addresses used for ‘‘zone” since this was not consistently available in were particularly common for the residential child care the reference datasets. City or ZIP fields are normally used facilities and fishing licenses. Removing these blanks may for ‘‘zone” and this is often required in geocoding since it introduce some bias. For example, residential child care speeds up database searches and prevents the occurrence facilities in the database may have a blank address while of a large number of ties. For example, an address like commercial child care facilities do not. However, the objec- 123 Main Street is expected to occur in almost every major tive in this study is to compare geocoding techniques, and city; the use of a City or ZIP field as a search criteron in not an assessment of the availability of child care. There- addition to the address itself prevents these ties or poten- fore, the removal of blanks was considered appropriate. tially incorrect matches. Since a separate address locator Table 1 reports the final sample size used for geocoding was built for each County, the use of a zone was not nec- and the number of blanks removed prior to geocoding essary. In the geocoding results, any ties were investigated where applicable. and none of these ties were a result of not using a ‘‘zone” field in the address locator. 5.3. Reference data for geocoding For each address locator, settings for spelling sensitivity and match score were set to identical thresholds. After Address point data, parcel data and street centerlines experimentation with a sample dataset, the minimum data were obtained from Bay, Collier and Seminole County match score was set to a value of 60 (out of 100). If the in April 2006. Currency of these data varied, but all had house number was not a one-to-one perfect match (for been updated in mid-2005 or later, and the data obtained address points and parcels) or did not fall within the house presented the most up-to-date and complete datasets avail- number range for a street segment (for streets), the maxi- able directly from the Counties’ GIS Department and/or mum score obtained by the ArcGIS 9 rule-based geocoding Property Appraiser. Each of these reference datasets con- algorithm was 52. As a result, using the minimum match tained several attributes for the address; although specific score of 60 in effect ensured that a match was only obtained fields varied, all had the following as a minimum: number, if the house number was an unambiguous perfect match. In 222 P.A. Zandbergen / Computers, Environment and Urban Systems 32 (2008) 214–232 addition, ties were permitted, but identified separately in and that the relationship between address points and the results. households is not consistent. For example, many multi-sto- rey apartment complexes with multiple units may get 5.5. Analysis assigned a single address points for every single building structure which may contain many units. For each database geocoded, the number of perfect Table 2 also reveals that there are many parcels without matches and ties (score = 100), the number of additional an address point. Most undeveloped parcels do not get matches and ties (score < 100), and the number of assigned an address point, or even an address for that mat- unmatched cases were determined. The overall match rate ter. The majority of parcels have only a single address for each database was determined by calculating the sum point – this would be typical of single family residential of all matches and ties as precentages of all address records. housing, but also applies to many other types of commer- The percentage of ties as a percentage of all matches was cial, industrial and institutional properties. A smaller num- also determined. For the child care facilities and properties ber of parcels have two address points and this would be with elevators the analysis was carried out for the entire typical of residential duplex units. An even smaller number database as well as separately for residential and commer- of parcels has more than two address points, and these cial addresses. would be typical of larger multi-family complexes and commercial sites with many individual businesses located 6. Results and discussion on the same parcel. The number of street segments within each County is 6.1. Description of reference data much lower than the number of parcels or address points. The number of parcels and address points per street seg- A summary of the number of features in each of the ref- ment varies from 4.78 to 12.30. While there is no signifi- erence datasets is provided in Table 2. The number of res- cance to the particular values for this ratio, it provides a idents per parcel for the three Counties ranges from 1.46 to general measure of the difference in resolution of the three 2.59. This number is strongly influenced by the presence of types of reference data. multi-family units with a large number of residents residing A closer examination of the relationship between on a single parcel. The highest number of 2.59 for Collier address points, parcels and street networks also reveals County is therefore not surprising, since multi-family units some interesting examples. For example, Tyndall Air Force are much more common here than in the other two Coun- Base in Bay County is classified as a single parcel owned by ties. The number of address points for Bay and Seminole the United States Air Force. However, a detailed street net- County is quite a bit smaller than the number or parcels, work within the base is part of the street centerlines data- while for Collier County the number is slightly higher. This base, and the base contains no fewer than 504 address reflects in part the same difference in multi-family housing: points for a residential complex located on the base. For a single parcel with multi-family units may contain many this particular example geocoding using only the parcels address points. would not generate any matches, but both the address When comparing the number of address points to the points and street network will likely produce matches. number of residents in each County, the ratios are much While an Air Force Base presents a very special case, large more similar with values ranging from 2.43 to 2.59. These parcels with many individual addresses are fairly common. values are similar to the average household size reported Collier County contains no less than 42 parcels with more in the 2000 Census, but this comparison is confounded than 100 address points. Most of these are mobile home by the fact that many address points are not residential, parks, RV parks, apartment complexes or other types of rental housing where many separate structures are located on the same parcel. Table 2 Descriptive summary of reference data used in geocoding 6.2. Placement of address points Bay Collier Seminole Population (2005 Census) 161,558 307,242 401,619 The placement of address points further illustrates some Parcels 110,651 173,787 154,919 Residents per parcel 1.46 1.77 2.59 of the differences between the address data models. One Address points 75,928 125,329 155,208 common approach to the placement of the address point Residents per address point 2.43 2.45 2.59 is at the centroid of the main structure on the parcel. Average household size (2000 Census) 2.48 2.39 2.59 Fig. 2 shows a typical example for a single family residen- Parcels w/o address point 37,352 66,427 18,229 tial neighborhood in Colliler County. There is only one Parcels with 1 address point 71,874 104,927 132,088 Parcels with 2 address points 1119 1047 3412 structure per parcel and a single address point is placed Parcels with >2 address points 306 1386 1190 within each parcel at (approximately) the building cen- Number of street line segments 15,892 14,125 21,580 troid. This results in one address point per residential unit. Parcels per street segment 6.96 12.30 7.18 The situation for multi-family residential areas is differ- Address points per street segment 4.78 8.87 7.19 ent as illustrated in Fig. 3 for Collier County. For duplexes P.A. Zandbergen / Computers, Environment and Urban Systems 32 (2008) 214–232 223 Fig. 2. Example of address points and parcel boundaries for single-family residential area in Collier County, FL. and townhouses (bottom right of Fig. 3), one address point tain only a single structure with a single unit, and the is placed for each residential unit, resulting in two or more address point is (mostly) placed at (approximately) the address points per structure. These residential units have building centroid. However, several of the structures in unique street numbers. For multi-unit apartment com- the shopping plaza contain multiple businesses, each with plexes (top center of Fig. 3), only a single address point their own street number, and an address point is placed is placed for each structure which may contain many resi- for each of these businesses. The placement of these dential units. These residential units share a single street address points is somewhat arbitrary, but appears to corre- number, and units are uniquely identified by their unit spond to the (approximate) centroid of the portion of the number (e.g. #101, 102, etc.). The examples in Figs. 2 structure occupied by the business. and 3 represent the most widely used approaches to the The examples in Figs. 2–4 illustrate the logic most use of address points for residential units within the data- widely followed: a unique address point is placed for every sets examined. unique street number, which may represent many units if For commercial, industrial and institutional properties, they share the same street number. The location of the the situations is different again. Fig. 4 shows a typical com- address points varies somewhat and can be near the cen- mercial area in Collier County. A number of parcels con- troid of the main structure or near the front of the 224 P.A. Zandbergen / Computers, Environment and Urban Systems 32 (2008) 214–232 Fig. 3. Example of address points and parcel boundaries for multi-family residential area in Collier County, FL. structure. For example, Fig. 5 shows a mixed residential/ A number of relevant trends can be derived from Table institutional area in Seminole County, and the placement 3 and Fig. 6. The match rates in general are highest for of the address points is clearly not near the building cen- street geocoding, followed by address points and parcels. troid, but in front of the building towards the street that Match rates for street and address point geocoding are gen- the address is located on. erally relatively close, with match rates for parcels being a distant 3rd and rarely exceeding 70%. This general trend 6.3. Geocoding match rates confirms the hypothesis that parcel geocoding results in lower match rates; parcel databases only associate one Table 3 reports the results for the match rates for the six address with a parcel, while in reality a single parcel may address databases for each of the three Counties. The num- contain many addresses. The slightly higher match rates ber of perfect matches and ties (score = 100), additional for street geocoding compared to address point geocoding matches (sore < 100) and unmatched cases are reported can in part be attributed to the way in which street num- separately. To facilitate the interpretation of the results, bers are stored in the two address data models. A single the overall match rates are also plotted in Fig. 6. address point contains a single house number, while a P.A. Zandbergen / Computers, Environment and Urban Systems 32 (2008) 214–232 225 Fig. 4. Example of address points and parcel boundaries for commercial area in Collier County, FL. street segment contains a range of house numbers. Address three Counties considered, but the pattern for the other point geocoding is therefore much more sensitive to data databases is not very consistent. The lack of agreement in entry errors since a match is only obtained when a perfect the match rates between the three Counties points to differ- one-to-one match can be made for the house numbers. The ences in the reference data. For example, for Collier addresses that were matched using street geocoding but did County street geocoding consistently produces match rates not produce a match using address point geocoding could of greater than 80%, while there is much more variability in therefore in fact represent false positives, since street geo- the street geocoding match rates for the other two Coun- coding does not provide a mechanism to determine if a par- ties. For parcel geocoding on the other hand, results for ticular street number exists. Without extensive field Collier County include some dramatically low match rates, validation however, it is not possible to say with certainty in particular for commercial properties (banks, elevators that the higher match rate obtained using street geocoding and grocery stores) which are all below 50%. Another dif- produced a better result. ference can be observed between address point and parcel Geocoding match rates vary strongly between the six geocding. For both Bay and Collier County, match rates address databases considered. Match rates for the for address point geocoding are much higher than for par- addresses of sex offenders are consistently highest for all cel geocoding, while for Seminole County the difference is 226 P.A. Zandbergen / Computers, Environment and Urban Systems 32 (2008) 214–232 Fig. 5. Example of address points, parcel boundaries and street centlines mixed residential and institutional area in Seminole County, FL. very small; in fact, for banks the match rates are the same same geocoding method for large sample sizes. Whitsel and for child care facilities the match rate for parcel geo- et al. (2006) report match rates of 30%, 77%, 78% and coding is higher. 79% using four different commercial vendors. Zhan et al. The variability in match rates also points to the influ- (2006) report match rates of 79% and 89% using two differ- ence of the data input quality. Differences in match rates ent ArcGIS-based methods. Match rates are known to be between the six different databases are similar in magni- lower in rural areas, which explains some of the variation tude to the difference in match rates between the three between different studies. Cayo and Talbot (2003) report methods of geocoding for the same database. This strongly match rates of 62% for rural addresses, 87% for sub-urban suggests that the quality of the input data and the quality areas and 94% for urban areas using street geocoding on a of the geocoding process are both important contributors large sample. Fewer studies have employed parcel geocod- to the quality of the final output, in this case the geocoding ing, but these few confirm the typically much lower match match rate. rate compared to street geocoding. Dearwent et al. (2001) The match rates reported in this study for street geocod- report a match rate of 70% for parcel geocoding versus ing are similar to those reported by other studies using the 89% for street geocoding of the same large sample. No Table 3 Summary of geocoding match results for six address databases for three Florida Counties using address point, parcel and street geocoding Category Commercial banks Child care facilities Properties with elevators Fishing licenses Grocery stores Sex offenders P.A. Zandbergen / Computers, Environment and Urban Systems 32 (2008) 214–232 Address Parcel Street Address Parcel Street Address Parcel Street Address Parcel Street Address Parcel Street Address Parcel Street n = 57 n = 77 n = 567 n = 10,336 n = 452 n = 289 Bay County Matched (score = 100) 31 17 34 57 37 58 307 117 332 5995 5195 6200 270 164 319 229 180 253 Tied (score = 100) 0 4 0 0 5 0 0 175 1 2 317 74 0 26 5 0 17 2 Matched (score < 100) 15 14 15 7 7 7 45 11 66 1547 1424 1628 66 22 77 20 9 21 Tied (score < 100) 0 2 1 0 1 1 0 22 3 23 246 86 0 6 3 1 6 1 Unmatched (score < 60) 11 20 7 10 24 8 215 242 165 2769 3154 2348 116 234 48 39 77 12 Match rate (%) 80.70 64.91 87.72 86.49 67.57 89.19 62.08 57.32 70.90 73.21 69.49 77.28 74.34 48.23 89.38 86.51 73.36 95.85 n = 127 n = 152 n = 2620 n = 11,116 n = 691 n = 189 Collier County Matched (score = 100) 70 47 81 91 77 101 1404 372 1435 6593 5216 6628 411 219 486 171 132 166 Tied (score = 100) 3 5 6 3 8 3 60 69 144 89 143 212 12 24 27 3 6 10 Matched (score < 100) 8 7 13 17 12 25 449 153 442 1728 1182 1702 28 23 74 7 7 3 Tied (score < 100) 2 1 3 3 0 5 16 37 72 88 139 501 2 5 10 0 0 5 Unmatched (score < 60) 44 67 24 38 55 32 691 1989 527 2618 4436 2073 238 420 94 8 44 5 Match rate (%) 65.35 47.24 81.10 75.00 63.82 80.72 73.63 24.08 79.89 76.45 60.09 81.35 65.56 39.22 86.40 95.77 76.72 97.35 n = 119 n = 206 n = 822 n = 9815 n = 891 n = 306 Seminole County Matched (score = 100) 44 44 45 130 128 131 356 283 382 5299 5045 5606 356 264 431 238 208 241 Tied (score = 100) 1 2 2 2 4 7 28 44 22 141 47 159 22 29 25 6 0 11 Matched (score < 100) 18 17 24 25 28 32 137 127 209 1624 1591 1717 46 102 147 18 27 25 Tied (score < 100) 0 0 15 2 1 7 19 11 24 92 154 402 4 8 67 2 0 2 Unmatched (score < 60) 56 56 33 47 45 29 289 364 192 2659 2978 1931 463 488 221 42 71 27 Match rate (%) 52.94 52.94 72.27 77.18 78.16 85.92 65.14 56.09 76.84 72.91 69.66 80.33 48.04 45.23 75.20 86.27 76.80 91.18 227 228 P.A. Zandbergen / Computers, Environment and Urban Systems 32 (2008) 214–232 100 90 80 Geocoding Match Rate (%) 70 Bay County 60 Address 50 Parcel Street 40 30 20 10 0 Commercial Daycares Elevators Fishing Grocery Sex Banks Licenses Stores Offenders 100 90 80 Geocoding Match Rate (%) 70 Collier County 60 Address 50 Parcel Street 40 30 20 10 0 Commercial Daycares Elevators Fishing Grocery Sex Banks Licenses Stores Offenders 100 90 80 Geocoding Match Rate (%) 70 Seminole County 60 Address 50 Parcel Street 40 30 20 10 0 Commercial Daycares Elevators Fishing Grocery Sex Banks Licenses Stores Offenders Fig. 6. Geocoding match rates by County. P.A. Zandbergen / Computers, Environment and Urban Systems 32 (2008) 214–232 229 published studies were identified that have employed erties with elevators, the match rates for residential address address point geocoding in the United States. are higher. In this case, however, the difference is much lar- ger, with relative high match scores (>80%) for all three 6.4. Commercial versus residential types of geocoding for all three Counties. The difference in match rates between the three types of geocoding is in Some of the most difficult addresses to correctly geocode fact quite small, suggesting that for single family residential are commercial and multi-unit residential addresses. Table address the choice of address data model does not influence 4 shows a comparison of commercial properties with eleva- match rates very strongly, in sharp contrast to multi-unit tors and multi-unit residential properties with elevators. residential addresses. Results indicate that match rates for commercial properties are consistently lower than for residential properties for 6.5. Ties both address point and street geocoding. For residential properties the match rates for address points are only Ties represent a concern in geocoding since they nor- slightly lower than for street geocoding, suggesting that mally require manual inspection to determine which of the address point reference data contains a fairly complete the ties represents the correct match. Even with manual and reliable representation of multi-unit residential inspection, no determination may be possible due to ambi- addresses. For commercial properties, the results for guities in either the reference data, the address input data, address point geocoding are not as good. Parcel geocoding or both. A low number of ties, therefore, is an indication of results in much lower match scores for both types of prop- a more reliable result. erties, with dramatically low match scores for residential Table 6 reports the number of ties as a percentage of all properties in Collier (13%) and Seminole County (12%). matches (score > 60) for all the address databases and each This confirms the poor performance of parcel geocoding of the three geocoding techniques. Address point geocod- for multi-unit residential addresses. ing consistently produces the lowest number of ties, while A second comparison between commercial and residen- the results for the other two techniques is more variable. tial addresses is provided by the results for child care facil- The percentage ties for street geocoding is generally low ities in Table 5. In this case the residential addresses are for Bay County (<1%) but much higher for Collier (6– mostly single family homes. Similar to the results for prop- 10%) and Seminole County (5–20%). Variability in the Table 4 Comparison of geocoding match rates for commercial and residential properties with elevators Category Commercial elevators Residential elevators Address Parcel Street Address Parcel Street n = 316 n = 251 Bay County Matched (score = 100) 144 70 155 163 47 177 Tied (score = 100) 0 53 1 0 122 0 Matched (score < 100) 35 11 55 10 0 11 Tied (score < 100) 0 18 1 0 4 2 Unmatched (score < 60) 137 164 104 78 78 61 Match rate (%) 56.65 48.10 67.09 68.92 68.92 75.70 n = 1181 n = 1439 Collier County Matched (score = 100) 562 266 674 842 106 761 Tied (score = 100) 39 42 53 21 27 91 Matched (score < 100) 119 109 128 330 44 314 Tied (score < 100) 4 27 27 12 10 45 Unmatched (score < 60) 457 737 299 234 1252 228 Match rate (%) 61.30 37.60 74.68 83.74 13.00 84.16 n = 787 n = 42 Seminole County Matched (score = 100) 335 281 356 21 2 26 Tied (score = 100) 18 42 18 10 2 4 Matched (score < 100) 136 127 204 1 0 5 Tied (score < 100) 16 10 23 3 1 1 Unmatched (score < 60) 282 327 186 7 37 6 Match rate (%) 64.17 58.45 76.37 83.33 11.90 85.71 230 P.A. Zandbergen / Computers, Environment and Urban Systems 32 (2008) 214–232 Table 5 Comparison of geocoding match rates for commercial and residential child care facilities Category Commercial child care Residential child care Address Parcel Street Address Parcel Street n = 54 n = 23 Bay County Matched (score = 100) 38 22 39 19 15 19 Tied (score = 100) 0 4 0 0 1 0 Matched (score < 100) 4 3 4 3 4 3 Tied (score < 100) 0 0 1 0 1 0 Unmatched (score < 60) 12 25 10 1 2 1 Match rate (%) 77.78 53.70 81.48 95.65 91.30 95.65 n = 104 n = 48 Collier County Matched (score = 100) 56 44 64 35 33 37 Tied (score = 100) 3 7 3 0 1 0 Matched (score < 100) 13 8 8 4 4 17 Tied (score < 100) 3 0 5 0 0 0 Unmatched (score < 60) 29 45 24 9 10 8 Match rate (%) 72.12 56.73 76.92 81.25 79.17 87.10 n = 124 n = 82 Seminole County Matched (score = 100) 66 61 64 64 67 67 Tied (score = 100) 2 4 4 0 0 3 Matched (score < 100) 18 21 24 7 7 8 Tied (score < 100) 2 1 7 0 0 0 Unmatched (score < 60) 36 37 25 11 8 4 Match rate (%) 70.97 70.16 79.84 86.59 90.24 95.12 Table 6 Ties as a percentage of all matches by geocoding method County Geocoding method Commercial banks Child care Elevators Fishing licenses Grocery stores Sex offenders Bay County Address 0.00 0.00 0.00 0.33 0.00 0.40 Parcel 16.22 12.00 60.62 7.84 14.68 10.85 Street 2.00 1.52 1.00 2.00 1.98 1.08 Collier Address 6.02 5.26 3.94 2.08 3.09 1.66 Parcel 10.00 8.25 16.80 4.22 10.70 4.14 Street 8.74 5.97 10.32 7.88 6.20 8.15 Seminole Address 1.59 2.52 8.70 3.26 6.07 3.03 Parcel 3.17 3.11 11.83 2.94 9.18 0.00 Street 19.77 7.91 7.22 7.12 13.73 4.66 percentage of ties for parcel geocoding is even greater, and geocoding could in part be due to false positives, but con- generally higher than for street geocoding for both Bay and firming this requires extensive field validation. Match rates Collier County. Several very high values, including a value using parcel geocoding are much lower, but this varies by of 61% for elevators in Bay County, highlight the lack of database type and geographic area. reliability obtained using parcel geocoding. Ties are also Substantial differences were observed between commer- higher in general for the commercial addresses, for all three cial and residential addresses and between different types geocoding techniques. of residential addresses. In general, higher match rates are obtained for residential addresses relative to commer- 7. Conclusions cial addresses. For single family residential addresses, match rates are relatively high for all three geocoding tech- This study has provided an empirical comparison of niques considered. For multi-unit residential addresses, address point, parcel and street geocoding. In general, however, parcel geocoding is very unreliable, while results match rates for address point geocoding are only slightly for both address points and street geocoding are much lower than for street geocoding. The higher rate for street better. P.A. Zandbergen / Computers, Environment and Urban Systems 32 (2008) 214–232 231 Geocoding match rates were found to vary substantially Address points appear very promising as an address by type of address database and by geographic area, sug- data model for geocoding. They represent excellent posi- gesting that determining an ‘‘acceptable” or ‘‘good” match tional accuracy, produce match rates only slightly lower rate requires very context specific considerations. Variabil- than those for street geocoding, and result in a low number ity in match rates between address models is only one of of ties. In addition, they provide an extra validation of the several considerations. The lack of consistency in match address input data, since it is less likely a false positive will rates between geographic areas using the same type of be introduced through a non-existing street number as may address database and the same address model also suggests be the case for street geocoding. While it may only be a that geocoding quality is very much a function of the qual- matter of time before address point data is available for ity and consistency of local reference data. Substantial dif- most of the United States, standardization efforts would ferences in match rates between the six different databases provide a logical framework for the development of a also suggest that the quality of the input data is a very crit- national address point database and an opportunity to ical contributor to the final geocoding match rate. learn from the efforts in other jurisdictions. One of the limitations of this study is that the compari- Future research efforts in this area should focus on son of geocoding methods is limited to three Counties in refinements of the address point data model, such as the the United States. Specific results in other jurisdictions occurrence of multiple units with the same street number may be different, but the general nature of the differences (currently represented as one address point), vertical repre- between the three address models is likely to be similar. sentation of units, and consistency in the placement of It should also be noted that the chosen study areas reflect address points. Possible refinements of the parcel data current best practice in terms of digital spatial data, and model consist of capturing multiple addresses within a sin- the availability of digital parcel boundary and address gle parcel, as well as residences with street addresses that point data for use in a GIS environment is not yet wide- are different from the legal street address of the parcel spread in the United States. itself. Finally, improved quality control during the original Of the three geocoding methods considered, street geo- capture of input data is paramount to improving geocoding coding is the most widely employed. Online geocoding ser- match rates. Continued improvements in the address data vices (Google Maps, Yahoo Maps, MapQuest) rely almost models and reference data will be in vain unless address exclusively on street geocoding, as do most commercial standardization and validation procedures during input geocoding firms. Digital street reference data is available are also improved. for nearly all areas within the United States and for many other jurisdictions. Street geocoding has also become very References affordable, in many cases even free. Many commercial GIS packages have built-in tools and reference data for Arctur, D., & Zeiler, M. (2004). Designing geodatabases: Case studies in street geocoding. Parcel geocoding is becoming more GIS data modeling. Redlands, CA: ESRI Press. Bonner, M. R., Han, D., Nie, J., Rogerson, P., Vena, J. E., & widely used, but typically in studies that are limited in geo- Freudenheim, J. L. (2003). Positional accuracy of geocoded addresses graphic scope. Digital parcel data is not available at the in epidemiologic research. Epidemiology, 14(4), 408–412. national or even State level within the United States, and Bow, C. J. D., Waters, N. W., Faris, P. D., Seidel, J. E., Galbraith, P. D., has to be obtained directly from local government agencies. Knudtson, M. L., et al. (2004). Accuracy of city postal code coordinates as a proxy for location of residence. International Journal The most recent estimates suggest that only about 60% of of Health Geographics, 3(5). all approximately 140 million parcels in the United States Burra, T., Jerrett, M., Burnett, R. T., & Anderson, M. (2002). Conceptual is available in a format that can be utilized in a GIS envi- and practical issues in the detection of local disease clusters: A study of ronment (Stage & Von Meyer, 2003). Even where available, mortality in Hamilton, Ontario. The Canadian Geographer, 46, 160–171. utilizing parcel data requires considerable more skill and Cayo, M. R., & Talbot, T. O. (2003). Positional error in automated effort than street geocoding, in part because parcel data is geocoding of residential addresses. International Journal of Health Geographics, 2(10). not specifically designed with geocoding in mind. Address Christen, P., Churches, T., & Zhu, J. X. (2002). Probabilistic name and point data is not widely used in the United States, mostly address cleaning and standardization. The Australasian Data Mining because data availability is limited. Commercial firms Conference, Canberra, Australia, December 3, 2002. report that approximately 40 million address points are Christen, P., Churches, T., & Willmore, A. (2004). A probabilistic available for the United States, covering selected metropol- geocoding system based on a national address file. The Australasian Data Mining Conference, Cairns, Australia, December 6, 2004. itan regions (ESRI, 2007; TeleAtlas, 2006). Where avail- Christian, P. (1998). Soundex – Can it be improved. Computer in able, address points are relatively easy to use for Geneaology, 6(5). geocoding in a GIS environment because geocoding is Churches, T., Christen, P., Lim, K., & Zhu, J. X. (2002). Preparation of one of the principal objectives in the collection of address name and address data for record linkage using hidden Markov models. BMC Medical Informatics and Decision Making, 2(9). points by local governments. Address point data is avail- Dearwent, S. M., Jacobs, R. J., & Halbert, J. B. (2001). Locational able as a national dataset in both the United Kingdom uncertainty in georeferencing public health datasets. Journal of and Australia, but to date no comparative analysis or qual- Exposure Analysis and Environmental Epidemiology, 11, 329–334. ity assessment has been performed between address points ESRI (2007). ArcGIS Business Analyst 9.2 now shipping. ArcNews, from multiple jurisdictions. Spring 2007. 232 P.A. Zandbergen / Computers, Environment and Urban Systems 32 (2008) 214–232 Goldberg, D. W., Wilson, J. P., & Knoclock, C. A. (2007). From text to Statistics Canada. (2002). Statistics Canada postal code conversion file geographic coordinates: The current state of geocoding. URISA reference guide. Statistics Canada, Ministry of Industry, Ottawa, ON Journal, 19(1), 33–46. 92F0153GIE. Gu, L., Baxter, R., Vickers, D., & Rainsford, C. (2003). Record linkage: Strickland, M. J., Siffel, C., Gardner, B. R., Berzen, A. K., & Correa, A. Current practice and future directions. Technical report 03/83, CSIRO (2007). Quantifying geocode location error using GIS methods. Mathematical and Information Sciences, Australia. Environmental Health, 6(10). Jacquez, G. M., & Waller, L. (2000). The effect of uncertain locations on TeleAtlas (2006). TeleAtlas Address Point product description. TeleAtlas, disease cluster statistics. In H. T. Mowrer & R. G. Congalton (Eds.), Lebanon, OH. Quantifying spatial uncertainty in natural resources: Theory and Waller, L. A. (1996). Statistical power and design of focused clustering applications for GIS and remote sensing (pp. 53–64). Chelsea, Michigan: studies. Statistics in Medicine, 15, 765–782. Arbor Press. Ward, M. H., Nuckols, J. R., Giglierano, J., Bonner, M. R., Wolter, C., Jaro, M. (1984). Record linkage research and the calibration of record Airola, M., et al. (2005). Positional accuracy of two methods of linkage algorithms. Bureau of the Census, Statistical Research Division geocoding. Epidemiology, 16(4), 542–547. Report Series, SRD Report No. Census/SRD/RR-84/27. Whitsel, E. A., Rose, K. M., Wood, J. L., Henley, A. C., Liao, D., et al. Karimi, H. A., & Durcik, M. (2004). Evaluation of uncertainties (2004). Accuracy and repeatability of commercial geocoding. American associated with geocoding techniques. Computer-Aided Civil and Journal of Epidemiology, 160(10), 1023–1029. Infrastructure Engineering, 19, 170–185. Whitsel, E. A., Quibrera, P. M., Smith, R. L., Catellier, D. J., Liao, D., Patman, F., & Shaefer, L. (2001). Is Soundex good enough for you? On the Henley, A. C., & Heiss, G. (2006). Accuracy of commercial geocoding: hidden risks of Soundex-based name searching. Language Analysis Assessment and implications. Epidemiological Perspectives and Inno- Systems Inc. vations, 3(8). Ratcliffe, J. H. (2004). Geocoding crime and a first estimate of a minimum Winkler, W. E. (1999). The state of record linkage and current research acceptable hit rate. International Journal of Geographical Information problems. RR99/03, United States Bureau of the Census. Science, 18(1), 61–72. Zandbergen, P. A., & Green, J. W. (2007). Error and bias in determining Ratcliffe, J. H. (2001). On the accuracy of TIGER-type geocoded address exposure potential of children at school locations using proximity-based data in relation to cadastral and census areal units. International GIS techniques. Environmental Health Perspectives, 115(9), 1363–1370. Journal of Geographical Information Science, 15(5), 473–485. Zandbergen, P. A. (2007). Influence of geocoding quality on environmen- Rushton, G., Armstrong, M. P., Gittler, J., Greene, B., Pavlik, C. E., tal exposure assessment of children living near high traffic roads. BMC West, M. W., et al. (2006). Geocoding in cancer research: A review. Public Health, 7(37). American Journal of Preventative Medicine, 30(2S), S16–S24. Zhan, F. B., Brender, J. D., De Lima, I., Suarez, L., & Langlois, P. H. Schootman, M., Sterling, D. A., Struthers, J., Yan, Y., Laboube, T., Emo, (2006). Match rate and positional accuracy of two geocoding methods B., & Higgs, G. (2007). Positional accuracy and geographic bias of four for epidemiologic research. Annals of Epidemiology, 16(11), 842–849. methods of geocoding in epidemiologic research. Annals of Epidemi- Zimmerman, D. L., Fang, X., Mazumdar, S., & Rushton, G. (2007). ology, 17(6), 464–470. Modeling the probability distribution of positional errors incurred by Stage, S., & Von Meyer, N. (2003). An assessment of parcel data in the residential address geocoding. International Journal of Health Geo- United States. Federal Geographic Data Committee’s Subcommittee graphics, 6(1). on Cadastral Data. Zimmerman, D.L. (2007). Estimating the intensity of a spatial point Stanier, A. (1990). How accurate is Soundex matching? Computers in process from locations coarsened by incomplete geocoding. Biometrics, Geneaology, 3(7), 286–288. in press.