/    Sign up×
Community /Pin to ProfileBookmark

random row from a large database

I have a table with for example the following columns:
– id
– title
– body
– type (image, swf, text)
– posted_on

Lets say there are over 3 million rows in this table.
I need to get 1 random row from within this 3 million.

Obviously using the ORDER BY RAND() is not a good idea and is extremely slow.

Another way would be getting the max_id then doing a random from 1 to max_id (call it $random_num) and then getting a query which states give me item >= $random_num LIMIT 1

This may seem to work and will work great if there were no id’s deleted.

It becomes very tricky now that I add a WHERE clause stating (for example) WHERE `type` = ‘text’; all of a sudden this method becomes unfair and chances of some numbers coming up more becomes much higher.

Let me explain (id’s below):
1 = text
2 = text
3 = image
4 = flash
5 = image
6 = text
7 = text

In the above case id #1 has a 1:7 chance to get selected while id #6 has a 4:7 chance since items 3,4, and 5 if selected would also cause #6 to be picked.

Basically I need to know what is the appropriate method of getting a random id in cases that I will be filtering using a WHERE statement.

Thanks.

to post a comment
PHP

33 Comments(s)

Copy linkTweet thisAlerts:
@svidgenJun 30.2011 — At first glance, I'd say you either need to accept the disproportionate odds as a necessary side effect of the large dataset, or you need to create indexes that specifically deal with the issue: either an index for each of text, image, and flash or an index that has a completely random component in itself.

So, add a random_index column and a key like so:

index type_randomindex (type, random_index)[/QUOTE]

You'll still end up with cases that are somewhat unfair, but it should be lessened by the [I]layers[/I] of randomness. That is, there will still be gaps that unfairly weight certain rows, but the gaps will have been generated randomly, and can at any time be regenerated.

Know what I mean?
Copy linkTweet thisAlerts:
@svidgenJun 30.2011 — An alternative, which I can't promise will be even remotely efficient, would be to first ensure that you have your filtering column indexed (type), and then select a particular random row from the full set of results.

So, generate a count (possibly using a fuzzy counting method) of rows where type=[type]. Generate a random number: 1 <= N <= count. Then simply select row N: [I]select * from T where type=[type] limit N,1[/I].

The assumption here is that your DMBS is smart enough to quickly find row N using the index on type. This may be a false assumption. But, I'd say it's worth benchmarking. And if you take the time to benchmark it, we'd be appreciative if you could share your findings.
Copy linkTweet thisAlerts:
@NogDogJun 30.2011 — You might find this thread from the PHPBuilder sister forum useful: http://phpbuilder.com/board/showthread.php?t=10338930
Copy linkTweet thisAlerts:
@XeroSiSauthorJun 30.2011 — excellent information guys, I will review them tomorrow and update you on how I finally approach this.
Copy linkTweet thisAlerts:
@eval_BadCode_Jul 01.2011 — 
Lets say there are over 3 million rows in this table.

I need to get 1 random row from within this 3 million.

Obviously using the ORDER BY RAND() is not a good idea and is extremely slow.
[/QUOTE]


[code=php]

$randomRow = mysql_fetch_assoc(
mysql_query(
sprintf(
"SELECT * FROM User LIMIT &#37;d,1",
rand(1,implode(mysql_fetch_row(mysql_query("SELECT COUNT(User_ID) FROM User"))))
)));

var_dump($randomRow);

[/code]


The cardinality of the PK is the reason for using it inside of the COUNT(PK) query. It will reflect the number of rows that exist in your table without the need for MySQL to look at every record or analyze the table.

Edit: This may seem to work and will work great if there were no id's deleted. [/QUOTE]

It does work. Doesn't matter if your ID's are random themselves or if huge gaps exist. Although it is possible to have an record get deleted between these two queries- in which case you can move the random part into a subquery so it will be run at the same time.

Basically I need to know what is the appropriate method of getting a random id in cases that I will be filtering using a WHERE statement.[/QUOTE]

Put the WHERE clause in both queries and only select the ID.

Cheers
Copy linkTweet thisAlerts:
@XeroSiSauthorJul 12.2011 — So are you basically saying:

Make a SELECT query with all the filters and only get ID's, place them in an array, random one array and get the info from there?

Basically make an index using array's?
Copy linkTweet thisAlerts:
@Michael_GJul 12.2011 — One possible solution for this problem is to divide your table rows into smaller blocks and then perform the random query between the low and high index of each block of rows.

Use a SELECT MAX(id) query to get the maximum number of rows of data in the table, then divide that number by another number such as 100, this give you the size of each block in the number of indexes, but does it within the PHP scope mostly, instead of in the database scope.

Then use PHP code to pick a random number between 1 and 100, multiply that number by the number of rows in each block of rows to get your starting index, and also get the ending index.

Then you query within that block of rows instead of within the total count of millions of rows, using the random function to get the random value. In this way you are dealing with a smaller block of data rows and the query will run faster.

Number of Rows per block = Total Number of Rows / 100

Round up the number if a remainder.

randomnumber = a random number between 1 and 100 using PHP

start index for block = randomnumber * Number of Rows Per Block.

end index for block = start index for block + Number of Rows Per Block.

SELECT *

FROM Table

WHERE id > start index for block

AND id < end index for block

AND other where condition

ORDER BY STATEMENT

You can try different combinations of numbers to get smaller blocks, such as using 1000 instead of 100, and so forth until a good solution is found. This is a good way to optimize your database SQL and run the queries faster for your PHP web application.

Michael G. Workman

[email][email protected][/email]
Copy linkTweet thisAlerts:
@svidgenJul 12.2011 — [B]@Michael_G[/B]

An interesting suggestion. But, there's no advantage over the previous (highly efficient) solution of picking a random number N such that min() N < max() and selecting the first row >= N. The problem, as the OP indicated, is the uneven probability caused by gaps in the sequence.
Copy linkTweet thisAlerts:
@eval_BadCode_Jul 12.2011 — So are you basically saying:

Make a SELECT query with all the filters and only get ID's, place them in an array, random one array and get the info from there?

Basically make an index using array's?[/QUOTE]


Not at all... I'm saying select the number of rows. Then select a random one using PHP's rand function and placing that random number into the LIMIT clause, so you only receive one result. If you making an index using arrays you're going to get a horribly inefficient result. I think mine is rather efficient.

In english its like saying:

Select the number of rows

create a random number between 0 and that number

select 1 row, starting at that random number (who cares how it's ordered)
Copy linkTweet thisAlerts:
@WonDer9Jul 13.2011 — Random numbers aren't actually random different numbers. They are numbers chosen not in order, but they don't have to be different. Basically numbers chosen by probability, numbers that are chosen differently all the time are actually chosen by a pattern, this is what algorithms are for. they create different situations chosen by set(consistent) parameters. so random isn't really random. just out of order.

to create an actual random(different all the time) you have to create a pattern to differentiate between the numbers. basically variables that change to create a constant difference. something like below will guarantee a different number all the time. you might have to tweak to create the difference you want.

$random = rand(1,3000000)

$random2 = rand(2,3000000) the 2 start creates a consistent difference.

($random * 456/$random2) = always a completely random number. But as I was stating you basically have a pattern in the equation. Funny how math works. if I knew php better I would be able to explain how to use this but I'm only a beginner.

the funny thing you will also notice about real random-ness is that most people don't really want it. we all like a little consistency even in our coding. You should really focus on what you want then try to create that situation. Random is kind of a hoax and you'll get tired of it real soon....you'll see.
Copy linkTweet thisAlerts:
@svidgenJul 13.2011 — [B]@WonDer9[/B]

Missing the point. Also, it seems like you're blending the lines between randomness and chaos. Randomness isn't about "always getting a different number." It's about ensuring that [B]each member of a set has an equal chance of being picked[/B] at each selection. And that could very well mean 3 is chosen 3 times in a row and never seen again for the life of the application.

It gets more philosophical than that, as you alluded. But, taking the OP's concerns into mind, it's important to note that the OP does want a [B]random[/B] row. That is, [B]each row must should have an equal chance of being selected at every selection[/B]. Now, we certainly can't do this sort of thing perfectly, but, if feasible, we can try to avoid the obvious factors that skew even probability distribution (like gaps in the sequence).
Copy linkTweet thisAlerts:
@eval_BadCode_Jul 13.2011 — Random numbers aren't actually random different numbers. They are numbers chosen not in order, but they don't have to be different. Basically numbers chosen by probability, numbers that are chosen differently all the time are actually chosen by a pattern, this is what algorithms are for. they create different situations chosen by set(consistent) parameters. so random isn't really random. just out of order.

to create an actual random(different all the time) you have to create a pattern to differentiate between the numbers. basically variables that change to create a constant difference. something like below will guarantee a different number all the time. you might have to tweak to create the difference you want.

$random = rand(1,3000000)

$random2 = rand(2,3000000) the 2 start creates a consistent difference.

($random * 456/$random2) = always a completely random number. But as I was stating you basically have a pattern in the equation. Funny how math works. if I knew php better I would be able to explain how to use this but I'm only a beginner.

the funny thing you will also notice about real random-ness is that most people don't really want it. we all like a little consistency even in our coding. You should really focus on what you want then try to create that situation. Random is kind of a hoax and you'll get tired of it real soon....you'll see.[/QUOTE]


What a random thing to say..
Copy linkTweet thisAlerts:
@svidgenJul 13.2011 — What a random thing to say..[/QUOTE]

So, there was an equal probability that he would have said anything else from the set of all things he can say?
Copy linkTweet thisAlerts:
@NoasITJul 13.2011 — So, there was an equal probability that he would have said anything else from the set of all things he can say?[/QUOTE]

yeah that comment happened to stored in row 27,918.
Copy linkTweet thisAlerts:
@svidgenJul 13.2011 — yeah that comment happened to stored in row 27,918.[/QUOTE]

Perhaps a gap in the sequence gave that row a higher probability of being selected.

I digress ...


[B]@XeroSiS[/B]

Have you found a solution that suits your needs?
Copy linkTweet thisAlerts:
@WonDer9Jul 14.2011 — Ok what ever. I was just adding to eval(badcode) solution of finding a number.

My interpretation of his problem was that he wanted his database to be chosen more evenly.

Eval(badcode) solution seemed good and I was giving another mathematical equation to select from the database. And I was stating that Randomness will chose to show 5 ten times over choosing 1,4, or 6 out of a group. Even when there are 3 million numbers.

Chaos as you corrected me will allow for his whole database to be chosen a long time before repeating again, but still with in the set amount and still allowing for some repetition. Giving all in his DB an close to equal chance. Eliminating what ever issues you think there are mathematically or philosophically.

1) Chaos is basically a better form of random. If you want all to have a chance.

2) Random is set selection, not probability. it will chose what it wants. Not chose within what it can. That is what chaos is. Random will chose to show 5 out of ten in ten tries if it so chooses too. Chaos will chose each number in a more proportionate manner and not by selection, but with in its range.

So if he wanted to have each row shown and with as little repetition as possible he would chose chaos or an simple equation to force numbers that vary widely as in my example.

That was all I was tryin to say fellas.
Copy linkTweet thisAlerts:
@svidgenJul 14.2011 — ...

Chaos as you corrected me will allow for his whole database to be chosen a long time before repeating again, but still with in the set amount and still allowing for some repetition. Giving all in his DB an close to equal chance. Eliminating what ever issues you think there are mathematically or philosophically.

1) Chaos is basically a better form of random. If you want all to have a chance.

2) Random is set selection, not probability. it will chose what it wants. Not chose within what it can. That is what chaos is. Random will chose to show 5 out of ten in ten tries if it so chooses too. Chaos will chose each number in a more proportionate manner and not by selection, but with in its range.

So if he wanted to have each row shown and with as little repetition as possible he would chose chaos or an simple equation to force numbers that vary widely as in my example.

...[/QUOTE]


Well no. That's effectively the opposite of what I said. Randomness is defined has giving each member of a set an equal chance of being selected. That means two things:

[LIST=1]
  • [*]At every selection, element X has an equal chance of being chosen as every other element. So, you may very well see element X N times in a row. But, that's both OK and GOOD.

  • [*]Over a "large number" of selections, you'll see that each element has been selected roughly the same number of times--even distribution. And this is precisely the aim of "randomness" in computer science: a selection pattern that is unpredictable at each selection, but very predictable over the course of [B]many[/B] selections.

  • [/LIST]


    Chaos, on the other hand, is a very deterministic system wherein the complexities of the system allow subtle changes in the environment to vastly change the selection distribution. Chaos does not evenly distribute selection probability. And this is almost always NOT what computer scientists are looking for.

    But again, it's not terribly relevant to the OP. The OP wants to be able to select a [B]random[/B] row from a database--not a ... "chaotically selected" row. He wishes to have each selection routine give an equal chance to every row.

    The most efficient way to do this in a database where rows are sequentially numbered without gaps is to simply pick a random number between min(row_id) and max(row_id) and select the row matching the ID. In a database with gaps, if you're [B]not (edit)[/B] concerned with letting some "chaos" into the system, you simply change your query to select the row where the ID is greater than or equal to your random number. However, it's no longer "random" -- because you now have rows with a [B]much[/B] higher probability of being selected.
    Copy linkTweet thisAlerts:
    @WonDer9Jul 14.2011 — Xerosis:

    "It becomes very tricky now that I add a WHERE clause stating (for example) WHERE type = 'text'; all of a sudden this method becomes unfair and chances of some numbers coming up more becomes much higher."[/Quote]


    1) Random has nothing to do with equal distribution or chance that is probability. Random selects in an unordered manner, this is actually happening, But it selects with repetition, so even if you ran an example from 1-3million numbers there would be selected numbers repeated in an order, which limits selection.

    He wants each row to be shown equally, chaos or True randomness will accomplish this.

    2)True Randomness which is chaos means selecting without repetition but in an unordered manner. Of course when the range is met it will begin again, which in fact is repetition.

    So in chaos form anywhere within the range a number will be selected, in randomness the numbers will be selected and repeated. That is what I assume he doesn't want. But I guess PHP runs differently.

    Whatever solution works for him is great thats what matters. And the equation was to help him select form his db more evenly with out repeating.
    Copy linkTweet thisAlerts:
    @svidgenJul 14.2011 — 1) Random has nothing to do with equal distribution or chance that is probability. Random selects in an unordered manner, this is actually happening, But it selects with repetition, so even if you ran an example from 1-3million numbers there would be selected numbers repeated in an order, which limits selection.

    He wants each row to be shown equally, chaos or True randomness will accomplish this.

    2)True Randomness which is chaos means selecting without repetition but in an unordered manner. Of course when the range is met it will begin again, which in fact is repetition.

    So in chaos form anywhere within the range a number will be selected, in randomness the numbers will be selected and repeated. That is what I assume he doesn't want. But I guess PHP runs differently.

    Whatever solution works for him is great thats what matters. And the equation was to help him select form his db more evenly with out repeating.[/QUOTE]


    I very much dislike being mean, but ... stop acting like an idiot. We all know what the OP wants. He wants each row to have an equal chance for selection. A concept, by the way, they call "randomness" in CS courses ...

    [B]Stop (edit)[/B] crusading for the appropriate definition of terms and the distinction between the essence of "true randomness" versus "true chaos"--which, by the way, you've gotten completely backwards.
    Copy linkTweet thisAlerts:
    @WonDer9Jul 14.2011 — Calm down, whatever case close. I just though it would be a good option n selecting from his db. Its still up to him anyway.

    And stop saying randomness has something to do with chance. that is probability. randomness has nothing to do with figuring out what the chance of something happening is.

    what ever I hope the solution works. is all.
    Copy linkTweet thisAlerts:
    @WonDer9Jul 14.2011 — Yeah your right it only took me to the second line to prove my point:
    Random sampling can also refer to taking a number of independent observations from the same [B]probability distribution[/B], without involving any real population.[/quote]

    I've been using these as my reference points:

    http://en.wikipedia.org/wiki/Randomness

    http://en.wikipedia.org/wiki/Chaos_theory

    http://en.wikipedia.org/wiki/Probability

    I'm not trying to argue, was just trying to help is all. Felt I could be of service. Should of just minded my own business.
    Copy linkTweet thisAlerts:
    @svidgenJul 14.2011 — And stop saying randomness has something to do with chance. that is probability.[/QUOTE]

    ...

    Yeah your right it only took me to the second line to prove my point:

    Quote:

    Random sampling can also refer to taking a number of independent observations from the same probability distribution[/QUOTE]


    ... anyone else find this ironic?
    Copy linkTweet thisAlerts:
    @svidgenJul 14.2011 — PS ... Did you also happen to read past the line the "proved your point" ... ?

    ... A simple random sample is selected so that all samples of the same size have an equal chance of being selected from the population.

    A self-weighting sample, also known as an EPSEM (Equal Probability of Selection Method) sample, is one in which every individual, or object, in the population of interest has an equal opportunity of being selected for the sample. Simple random samples are self-weighting.

    ... etc.[/QUOTE]
    Copy linkTweet thisAlerts:
    @eval_BadCode_Jul 15.2011 — PS ... Did you also happen to read past the line the "proved your point" ... ?[/QUOTE]

    Can you reset my password on monkeywar.thepointless.com
    Copy linkTweet thisAlerts:
    @svidgenJul 15.2011 — Can you reset my password on monkeywar.thepointless.com[/QUOTE]

    haha ... seriously? i didn't think there were any un-expired users left ...
    Copy linkTweet thisAlerts:
    @WonDer9Jul 15.2011 — Listen if you have the intention of helping someone then just do that but don't try to force your opinion with false or out of context information. He made this tread because he does not want a certain selection from his group to be chosen but to have his whole group to be chosen randomly(true random = chaos).

    Xerosis:

    "It becomes very tricky now that I add a WHERE clause stating (for example) WHERE type = 'text'; all of a sudden this method becomes unfair and chances of some numbers coming up more becomes much higher."[/Quote]


    Here it explains the types of random application, not randomness itself. Stop making things up. All types of randomness below that are applied to [B]control the probability[/B], not that randomness has anything to do with it . Thats why the are [B]focused[/B] meaning controlled randomness which is not randomness, but selection.

    Why can't you understand that? Its not random if its controlled.

    Which is similar to repetition which is a form of selection which is a form of control.

    To actually have a equal chance, all not those already selected but all, at anyone moment need to have an equal chance to show. Once shown your chance has been given to show again is to take it from some one else, which is not equal but selection.

    You keep assuming that once something is selected that it should have a chance to be selected again. But if it was selected already why is it still apart of the equation. That is not random that is selection. Once a selection is met others should have the same opportunity, to repeat is to actually remove that opportunity to another.

    But like chaos, randomness, and any other form of counting(which is a simple way of saying statistics). They are all limited by their quantity so repetition is inevitable. Please everyone read as to see I'm not making this up just trying to help. http://en.wikipedia.org/wiki/Random_sample


    [B]Types of random sample[/B]

    A [B]simple random sample[/B] is selected so that all samples of the same size have an equal chance of being selected from the population.

    A [B]self-weighting sample[/B], also known as an EPSEM (Equal Probability of Selection Method) sample, is one in which every individual, or object, in the population of interest has an equal opportunity of being selected for the sample. Simple random samples are self-weighting.

    [B]Stratified sampling[/B] involves selecting independent samples from a number of subpopulations, group or strata within the population. Great gains in efficiency are sometimes possible from judicious stratification.Cluster sampling involves selecting the sample units in groups. For example, a sample of telephone calls may be collected by first taking a collection of telephone lines and collecting all the calls on the sampled lines. The analysis of cluster samples must take into account the intra-cluster correlation which reflects the fact that units in the same cluster are likely to be more similar than two units picked at random.[/Quote]



  • + In statistics, a [B]simple random sample[/B] is a subset of individuals (a sample) chosen from a larger set (a population). http://en.wikipedia.org/wiki/Simple_random_sample


  • + In [B]statistics, stratified sampling[/B] is a method of sampling from a population. http://en.wikipedia.org/wiki/Stratified_sampling


  • + [B]Cluster sampling[/B] is a sampling technique used when "natural" groupings are evident in a statistical population. http://en.wikipedia.org/wiki/Cluster_sampling


  • All basically selecting a group from within a group. Which Xerosis does not want. He wants the whole population to be selected. But in reality like eval(badcode) pointed he would have to create a reference to create a number so it can select from the database. The reference would create a more random number, which I hope my equation would help to make better.

    This tread should be made resolved because no further real arguments can be mad.
    Copy linkTweet thisAlerts:
    @svidgenJul 15.2011 — Listen if you have the intention of helping someone then just do that but don't try to force your opinion with false or out of context information.[/QUOTE]

    Well, that's kinda my point too: why did you step in here to ***** about philosophy behind randomness?

    The fact that your interpretation differs from the one taught in universities is really a different matter. Since you had the curtesy to inform us of the "true" definitions for random and chaos, I figured I should correct your misunderstanding. And frankly, it's a little ironic that you're accusing me of [I]forcing my opinion[/I] on folks, because I'm really just relaying what I learned in school, in response to your crusade to change [I]our[/I] opinions.
    Copy linkTweet thisAlerts:
    @svidgenJul 15.2011 — Though, I will admit ... it's been awhile since I've been in the classroom.
    Copy linkTweet thisAlerts:
    @eval_BadCode_Jul 16.2011 — This has nothing to do with sampling... we know the entire population. No SRS- no random sample of samples, no bootstrap samples or anything like that is required.

    get it done with rand() and stop joking around ?

    You see, there was a day when you couldn't do that, and you had to use a ruler and put it onto a page with "preset random numbers" for you and go through them like that. Of course you might want to pick a new row each time. You want random 1-100000? then read the next 6 numbers, random 3-8, use modulo.

    There is no such thing as random anyways- only destiny.
    Copy linkTweet thisAlerts:
    @WonDer9Jul 16.2011 — Guys you asked me why I posted. So I figured you didn't understand my equation was to be added to eval(badcode) example. I was just trying to explain how it fit it and how it worked to create a result. didn't mean to make it sound like a discussion of life.

    But all things are relative like Einstein said, so it probably just seemed that way. Can't say I don't love a good debate.
    Copy linkTweet thisAlerts:
    @svidgenJul 16.2011 — This has nothing to do with sampling... we know the entire population.[/QUOTE]
    Well, not exactly true. It's a sample of size 1 from the population of rows.

    ... get it done with rand() and stop joking around ?[/QUOTE]
    Well yeah. That's my point. The philosophy of the matter didn't need to be mentioned. The OP knows what he wants. He accurately called it random and recognized gaps in the sequence would lead to a loss of randomness--and whether we all agree with how he's defined random, we can all (hopefully) see the heart of issue.

    There is no such thing as random anyways- only destiny.[/QUOTE]
    Enter quantum physics ... ?
    Copy linkTweet thisAlerts:
    @WonDer9Jul 20.2011 — I guess a job well done! Funny were did Xerosis go in all of this
    ×

    Success!

    Help @XeroSiS spread the word by sharing this article on Twitter...

    Tweet This
    Sign in
    Forgot password?
    Sign in with TwitchSign in with GithubCreate Account
    about: ({
    version: 0.1.9 BETA 5.3,
    whats_new: community page,
    up_next: more Davinci•003 tasks,
    coming_soon: events calendar,
    social: @webDeveloperHQ
    });

    legal: ({
    terms: of use,
    privacy: policy
    });
    changelog: (
    version: 0.1.9,
    notes: added community page

    version: 0.1.8,
    notes: added Davinci•003

    version: 0.1.7,
    notes: upvote answers to bounties

    version: 0.1.6,
    notes: article editor refresh
    )...
    recent_tips: (
    tipper: @Yussuf4331,
    tipped: article
    amount: 1000 SATS,

    tipper: @darkwebsites540,
    tipped: article
    amount: 10 SATS,

    tipper: @Samric24,
    tipped: article
    amount: 1000 SATS,
    )...