please help me. I am stuck

asked Apr 25, 2022 by Paul Hill (300 points)

I am trying to create a simple implementation of the FlajoletMartin algorithm using Python. The stream will be the contents of a text file and you will produce an approximation of the number of unique words in the file as given by the algorithm. You will need to process the file one line at a time and may not store any part of the file. You can obtain words by splitting the lines on whitespace. Your code will be run from a terminal according to the following command

The text file is:

HERE IS THE TEXT FILE:

this is a fun file
this is the second line of the file
this is the third line of the file
this is the fourth and final line of the file

import sys

for line in sys.stdin:
words = line.split()
for word in words:
bin_string = bin(hash(word))
print(bin_string)

2 Answers

answered Apr 25, 2022 by Eveheeero (140 points)

answered Apr 26, 2022 by Peter Minarik (101,420 points)

I tried that. And I receive a syntax error. Can you show me?

The below code would "work".

import sys

strings = set()

for line in sys.stdin:
    words = line.split()
    for word in words:
        strings.add(word)

print("\nNumber of distinct elements:", len(strings))

By "work" I mean the code runs and calculates the exact number of distinct elements in the input.

However, this code definitely does not implement the Flajolet-Martin algorithm.

commented Apr 26, 2022 by Paul Hill (300 points)

Does this run in the terminal according to the following command

commented Apr 26, 2022 by Peter Minarik (101,420 points)
edited Apr 26, 2022 by Peter Minarik

It runs in OnlineGDB just fine. Make sure on the bottom of the page you select INPUT --> TEXT (not Interactive Console) so you can paste your file's content there.

It also should run just fine on any computer if you redirect your files input into your executable.

e.g.:

./WordCount.py < MyInput.txt

I hope this helps.

commented Apr 26, 2022 by Paul Hill (300 points)

Is the any way you can try to do the Flajolet algorithm with it.    I can do it with a list of number just fine.

stream=[1,2,3,4,5,6,4,2,5,9,1,6,3,7,1,2,2,4,2,1]

maxnum=0
for i in range(0,len(stream)):
    val= bin((1*stream[i] + 6) % 32)[2:]

    sum=0
    for j in range(len(val)-1,0,-1):

        if val[j]=='0':
            sum+=1
        else:
            break
    if sum>maxnum:
        maxnum=sum

print('distict elements', 2**maxnum)

is there any way to modify both codes so it is more like the algorithm

commented Apr 28, 2022 by Peter Minarik (101,420 points)

Isn't it as simple as you tried in your original code using the hash() function? Get the hash of a word and use that, instead of stream[i].

please help me. I am stuck

Please log in or register to answer this question.

2 Answers

Please log in or register to add a comment.

Please log in or register to add a comment.