[ | Next | Previous | Up ]

Random Sampling on the Internet

From:
Date: 18 Jan 2000
Time: 01:31:46

Comments

Stan,

I have managed to find some information with regards to sampling on the Internet. Below is an extract from the site.

http://www.cc.gatech.edu/gvu/user_surveys/survey-1998-10/

mailto:www-survey@cc.gatech.edu. GVU's WWW Surveying Team GVU Center, College of Computing Georgia Institute of Technology Atlanta, GA 30332-0280

Thanks for your help - regards Lloyd

Survey Methodology

The Internet presents a unique problem for surveying. At the heart of the issue is the methodology used to collect responses from individual users. Since there is no central registry of all Internet users, completing a census, where an attempt is made to contact every user of the Internet, is neither practical nor feasible financially. As such, Internet surveys attempt to answer questions about all users by selecting a subset of users to participate in the survey. This process of determining a set of users is called sampling, since only a sample of all possible users is selected.

Sampling

There are two types of sampling, random and non-probabilistic. Random sampling creates a sample using a random process for selection of elements from the entire population. Thus, each element has an equal chance of being chosen to become part of the sample. To illustrate, suppose that the universe of entities consists of a hat that contains five slips of paper. A method to select elements from the hat using a random process would be to 1) shake the contents of the hat, 2) reach into the hat, and 3) pick an slip of paper with one's eyes closed. This process would ensure that each slip of paper had an equal chance of being selected. As a result, one could not claim that some slips of paper were favored over the others, causing a bias in the sample.

Given that the sample was selected using a random process, and each element had an equal chance of being selected for the sample, results obtained from measuring the sample can generalize to the entire population. This statistical affordance is why random sampling is widely used in surveys. After all, the whole purpose of a survey is to collect data on a group and have confidence that the results are representative of the entire population. Random digit dialing, also called RDD, is a form of random sampling where phone numbers are selected randomly and interviews of people are conducted over the phone.

Non-probabilistic sampling does not ensure the elements are selected in random manner. It is difficult then to guarantee that certain portions of the population were not excluded from the sample since elements do not have an equal chance of being selected. To continue with the above example, suppose that the slips of paper are colored. A non-probabilistic methodology might select only certain colors for the sample. It becomes possible that the slips of paper that were not selected differ in some way from those that were selected. This would indicate a systematic bias in the sampling methodology. Note that it is entirely possible that the colored slips that were not selected did not differ from the selected slips, but this could only be determined by examining both sets of slips.

Self-selection

Since there is no centralized registry of all users of the Internet and users are spread out all over the world, it becomes quiet difficult to select users of the entire population at random. To simplify the problem most surveys of the Internet focus on a particular region of users, which is typically the United States, though surveys of European, Asian, and Oceanic users have also been conducted. Still, the question becomes how to contact users and get them to participate. The traditional methodology is to use RDD. While this ensures that the phone numbers and thus users are selected at random, it potentially suffers from other problems as well, namely self-selection.

Self-selection occurs when the entities in the sample are given a choice to participate. If a set of members in the sample decides not to participate, it reduces the ability of the results to generalize to the entire population. This decrease in the confidence of the survey occurs since the group of that decided not to participate may differ in some manner from the group that participated. It is important to note that self-selection occurs in nearly all surveys of people. In the case of RDD, if a call is placed to a number in the sample and the user hangs up the phone, self-selection has occurred. Likewise, if in a mail-based survey, certain users do not respond, self-selection has occurred. While there are techniques like double sampling to deal with those members who chose not to participate or respond, most surveys do not employ these techniques due to their high cost.


Last changed: November 20, 2007