Certainly Rimm's infamous figure of 83.5% has been shot down from a methodological perspective, but how much porn is there on USENET? Not finding anyone else putting together statistics, I decided to do a little digging as background to a column on Cybercensure (more pointers on the topic can be found here). This page does not purport to be a serious study, only a personal attempt to determine whether the figure "83.5%" was unreasonable.
If anybody reading this knows of other easily available similar stats, I wouldn't mind finding out about it. (psm@sics.se)
At the time of the sample, our local newsfeed received 8458 newsgroups. Of these, 3452 carry no or very little traffic. I chose 100 postings at random from the 7308 files that were over 10000 bytes large (including full header) and were downloaded to the server on 6th of August (a Sunday). I looked at each posting manually and classified them. Multi-part messages were assembled in order to determine nature of posting. Images were decoded to determine the nature of their content.
Results in number of postings:
5 Artwork (non-sexual) 2 Images (non-sexual) 3 Models (non-sexual) 6 Sounds (one purports to be erotic) 4 Sexually explicit text 5 Parts of movie, probably hardcore 4 Hardcore porn (2 from BBS) 1 Don't know (probably softporn cartoon) 1 Don't know (probably softporn) 4 Sort of hardcore porn (1 from BBS) 22 Softporn (1 from BBS) 1 font 1 movie (probably non-sexual) 3 Program 38 Others (not images)
If we take into account crosspostings and multi-part messages (text as well as images), and do a little grouping, we get:
4.36 Images (and movie) of non-sexual nature 32.53 Other 2.33 Programs 0.81 Sounds (0.25 purports to be erotic) 3.50 Sexually explicit stories 0.82 Movies (probably hardcore) 2.83 Hardcore porn 2.70 Sort of hardcore porn 13.85 Softporn 1.00 Don't know (probably softporn) 0.33 Don't know (probably softporn cartoon) (Total: 65.06)
(We adjust down the significance of finding a cross-posted and/or multi-part file. Thus, a "hit" of a file that is cross-posted 3 times and comes in 5 parts counts as 1/15th towards its category.)
Now, as far as images are concerned, we get the following. Here we include movies and count "probably" as "yes", i.e. we're conservative (we prefer to overestimate the presence and harshness of porn), and figure out the relative percentages.
16.8 4.36 Non-sexual 24.5 6.35 Sort of hardcore or harsher (3.65 hardcore, 2.7 sortof hardcore) 58.6 15.18 Softporn 99.9 25.89 (total)
Thus, 83% is sexual, roughly a third of which is slightly hardcore or harder (which in turn is 50-50 slightly hard and hardcore). But only 16% are from BBSs. And only 14% are hardcore.
Conclusion: this sample cannot reject the statement that ~80% of images on the USENET have sexual content.
Another issue is how common are postings of pornographic images. "Weight" of traffic on Usenet is most reasonably measured in terms of number of postings.
I selected 100 posts randomly from 41161 postings that were dated 9th August, 1995 (a wednesday).
After adjusting for cross-posting (there were no multi-part postings in the sample), the result were the equivalent of 75 posts. I classified them manually in two groups: image/text and sexual/non-sexual. These coincided as the image(s) were sexual or erotic, and none of the text selected were:
1.7% 1.25 image (sexual or erotic) 98.3% 73.70 text (non-erotic)
In other words, measured in terms of poster activity, roughly 2% of Usenet postings are images. Using our values from sample I above, this would indicate roughly 1.5% of Usenet postings being of a sexual nature, and less than 0.24% are hardcore.
The general statistic not only seems correct, but is understated - there are rather few images posted that are non-sexual/erotic, in all the newsfeeds.
However, the hardcore portion is small (14%) and the extreme hardcore (pedophilia etc) is exceedingly small. Searching manually, I do indeed find images of all sorts of interesting sexual deviations, but they are rare (clearly under 0.5%).
I do not find BBS source to be as dominant as Rimm found, but quite the opposite.
Rimm only selected the following newsgroups:
alt.binaries.pictures.erotica alt.binaries.pictures.bestiality alt.sex.fetish.watersports alt.binaries.pictures.female alt.binaries.pictures.tasteless
Rimm motivates this by stating that these were the "largest available at the research site". Three of these are rare, as far as I can tell (tasteless, watersports, and bestiality). So basing any analysis on them as if they were representative would be flawed. The skewed selection would explain a little the low number of images Rimm found (3254 over a multiple-week period). Roughly half of the images in my sample come from the newsgroup alt.binaries.pictures.erotica and its subgroups.
In effect we have selected 26 images from what we estimate to be 1892 images in the newsfeed that day. The reliability of the 83% figure is non-trivial to estimate since the population size (1892) is also an estimate based on the same sample. Also, the number of selected images are rather few. But since our goal is just to determine whether the ~80% figure is absurd or not, this should suffice (we conclude that it is not absurd).
We assume that feeds on a Sunday are representative.
We assume that our site is representative.
We overestimate porn presence and harshness by "rounding up" categories.
The classification of pictures was relatively straight forward:
The feed is not censored coming in, as far as we know. It includes bestiality, incest, tasteless, and pedophilia feeds, but these were sufficiently few not to be detected by the sample.
We cannot derive figures of percentage of overall posts that are of a sexual nature, since our selection criteria is that the postings are of at least 10000 bytes length. That said, 39% of all large postings are of a sexual nature (after compensating for cross-posting and/or multi-part postings).
We classify as pornographic everything sexual, though this is (as I understand it) not the correct way of using the term, which has a specific legal meaning. "Pornographic" implies "contact", including intercourse, or I guess something "equivalent". This would in our classification correspond to "hard core". Thus, strictly speaking only 14% of images on the USENET are "pornographic".
We assume that images are always posted in one or more pieces at least 10000 bytes large. I have not formally verified this, except to check manually a few hundred postings in several image-intensive groups, where there was not a single exception from this assumption.
The date of the first sample was the date stamp of the local Unix file. The date in the second sample was the date contained in the posting itself, and stored in the overview database of the InterNetNews server.
More detailed data can be found here
List of newsgroups sampled are here here
Some info on method and scripts used are here
August 9th, 1995: created
September 11th, 1995: added pointer to newsgroup list, and scripts
January 6th, 1996: added pointer to meta-page