Backend Development 13 min read

URL Deduplication Techniques in Java, Redis, and Databases

This article reviews six practical URL deduplication methods—including Java Set, Redis Set, database queries, unique indexes, Guava Bloom filter, and Redis Bloom filter—explaining their principles, providing complete implementation code, and recommending the most suitable approach for different system scales.

Full-Stack Internet Architecture

Sep 13, 2020

URL Deduplication Techniques in Java, Redis, and Databases

URL deduplication is a common problem in daily development and interview questions at major internet companies such as Alibaba, NetEase Cloud, Youku, and Zuoyebang. This article examines six practical solutions and provides full implementations.

Deduplication Strategies

Use Java Set collection and check the result of add() (success means the URL is unique).

Use Redis Set collection with SADD to determine uniqueness.

Store URLs in a relational database and query for duplicates with SQL.

Create a unique index on the URL column and rely on insertion errors to detect duplicates.

Apply Guava's Bloom filter for high‑performance, memory‑efficient deduplication.

Use Redis's Bloom filter module for distributed environments.

1. Java Set Deduplication

The Set collection guarantees element uniqueness; attempting to add a duplicate returns false. The following code demonstrates this approach:

public class URLRepeat {
    // URLs to be deduplicated
    public static final String[] URLS = {"www.apigo.cn", "www.baidu.com", "www.apigo.cn"};
    public static void main(String[] args) {
        Set<String> set = new HashSet();
        for (int i = 0; i < URLS.length; i++) {
            String url = URLS[i];
            boolean result = set.add(url);
            if (!result) {
                // Duplicate URL
                System.out.println("URL 已存在了：" + url);
            }
        }
    }
}

Running the program prints:

URL 已存在了：www.apigo.cn

2. Redis Set Deduplication

Redis Set works similarly to Java's Set. Using redis-cli we can see that a successful SADD returns 1 (unique) and 0 indicates a duplicate.

In a Spring Boot project, the code below uses RedisTemplate to perform the same check:

@Autowired
RedisTemplate redisTemplate;

@RequestMapping("/url")
public void urlRepeat() {
    for (int i = 0; i < URLS.length; i++) {
        String url = URLS[i];
        Long result = redisTemplate.opsForSet().add("urlrepeat", url);
        if (result == 0) {
            // Duplicate URL
            System.out.println("URL 已存在了：" + url);
        }
    }
}

To use RedisTemplate, add the dependency:

<!-- Add operation RedisTemplate reference -->
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-data-redis</artifactId>
</dependency>

and configure the connection in application.properties:

spring.redis.host=127.0.0.1
spring.redis.port=6379
#spring.redis.password=123456  # Uncomment if a password is required

3. Database Deduplication

A relational table can store URLs. The table definition and indexes are shown below:

/* Table: urlinfo */
create table urlinfo (
    id int not null auto_increment,
    url varchar(1000),
    ctime date,
    del boolean,
    primary key (id)
);

/* Index: Index_url */
create index Index_url on urlinfo (url);

Inserting URLs and querying with SELECT COUNT(*) FROM urlinfo WHERE url = ? reveals whether a URL already exists (count > 0).

4. Unique Index Deduplication

Creating a unique index on the url column forces the database to reject duplicate inserts, providing an automatic deduplication mechanism:

create unique index Index_url on urlinfo (url);

5. Guava Bloom Filter Deduplication

Bloom filters offer space‑efficient probabilistic membership testing with a configurable false‑positive rate. Using Guava, the implementation is:

public class URLRepeat {
    public static final String[] URLS = {"www.apigo.cn", "www.baidu.com", "www.apigo.cn"};
    public static void main(String[] args) {
        BloomFilter<String> filter = BloomFilter.create(
                Funnels.stringFunnel(Charset.defaultCharset()),
                10, // expected insertions
                0.01); // false‑positive probability
        for (int i = 0; i < URLS.length; i++) {
            String url = URLS[i];
            if (filter.mightContain(url)) {
                System.out.println("URL 已存在了：" + url);
            } else {
                filter.put(url);
            }
        }
    }
}

Running the program prints the duplicate URL.

6. Redis Bloom Filter Deduplication

Redis 4.0+ supports Bloom filters via the bf.* commands. After enabling the module (e.g., via Docker), the following Java code demonstrates add and existence checks using Lua scripts:

import redis.clients.jedis.Jedis;
import utils.JedisUtils;
import java.util.Arrays;

public class BloomExample {
    private static final String _KEY = "URLREPEAT_KEY";
    public static final String[] URLS = {"www.apigo.cn", "www.baidu.com", "www.apigo.cn"};

    public static void main(String[] args) {
        Jedis jedis = JedisUtils.getJedis();
        for (int i = 0; i < URLS.length; i++) {
            String url = URLS[i];
            boolean exists = bfExists(jedis, _KEY, url);
            if (exists) {
                System.out.println("URL 已存在了：" + url);
            } else {
                bfAdd(jedis, _KEY, url);
            }
        }
    }

    /** Add element */
    public static boolean bfAdd(Jedis jedis, String key, String value) {
        String luaStr = "return redis.call('bf.add', KEYS[1], KEYS[2])";
        Object result = jedis.eval(luaStr, Arrays.asList(key, value), Arrays.asList());
        return result.equals(1L);
    }

    /** Check existence */
    public static boolean bfExists(Jedis jedis, String key, String value) {
        String luaStr = "return redis.call('bf.exists', KEYS[1], KEYS[2])";
        Object result = jedis.eval(luaStr, Arrays.asList(key, value), Arrays.asList());
        return result.equals(1L);
    }
}

The console output again shows the duplicate URL.

Conclusion

The article presents six URL deduplication solutions. Among them, Redis Set, Redis Bloom filter, database queries, and unique indexes are suitable for distributed systems; for massive distributed workloads, the Redis Bloom filter is recommended, while for single‑machine large datasets, Guava's Bloom filter offers an efficient alternative.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java Database Redis bloom-filter Set url deduplication

Written by

Full-Stack Internet Architecture

Introducing full-stack Internet architecture technologies centered on Java

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.