Netsweeper: We’ve now classified one billion URLs

Netsweeper tells me they have now classified one billion URLs: 

On/around June 21, 2008, the Netsweeper central database will surpass 1 billion currently categorized URLs.   This is (one of) the critical differences of the Netsweeper filtering capability. And this capacity is accelerating.  We have just completed our full installation at BSNL the (largest) ISP in India.  Within 18 months, they estimate an additional 6 million users will average 30 searches, and 4 new URLs categorized per user, each day.  This will generate more than 24 million new URLs categorized each day. Our next billion should happen within the next 6 months. 

Is this credible?  The short answer is depending how you count, yes.  According to the latest figures on Wikipedia, the web in 2008 contains 100 million web sites with 63 billion pages.  So if you’re counting  pages as URLs, the figure is entirely credible.  Netsweeper builds this database by maintaining a central database “in the cloud” that customers access, and using AI to automatically add unclassified URLs, as described on their website: 

To solve the traditional problems with purely list-based filters, Netsweeper developed a dynamic new approach. This approach uses a central database of categorized URLs. Each Netsweeper Policy server contains only the URLs accessed by the local users (students, patrons, employees, home users).  If a user visits a site that isn’t in the local database, it is requested from the central database (CNS). The CNS provides the information about the requested site to the Policy server, which caches the information so as to be ready should the request be made again. This ensures that the Policy server only has relevant URLs in its cache

FYI, Netsweeper has it’s own database look up service, which I’ve added to my list of now eight filter database look up services.

3 Responses

  1. Interesting article but it highlights the fundamental problem of how relying on a URL database alone to provide effective filtering is fundamentally flawed.

    A billion URLs is a significant improvement over other URL databases of 40 million URLs, but still only provides around 1.6% coverage based on the estimated number of URLs.

    That still leaves IT managers with a big challenge – do they block or allow access to requested URLs not listed in the database. Allow access and they risk underblocking and opening up their network and users to increased risk. Deny access and they risk overblocking users, causing user frustration and lots of support calls to add legitimate URLs to the allowed list. A difficult choice to make.

    Indeed, yesterday’s announcement by ICANN – see http://www.pcmag.com/article2/0,1895,2321845,00.asp – simply makes the problem worse. The expected rush to add new domains will result in the URL database approach to web filtering becoming even less effective.

    Jim

  2. [...] I noticed a post on David Burt’s blog, filteringfacts.org, that NetSweeper had told him they had exceeded 1 [...]

  3. [...] million new URLs categorized each day. Our next billion should happen within the next 6 months. Netsweeper: We?ve now classified one billion URLs Filtering Facts I think BSNL is filtering websites !! Hope they block the proper ones instead of youtube and [...]

Leave a Reply