Counting unique domains from email address field
I would like to perform some basic checks for email addresses my users are entering. Sometimes they are simply making a typo in the domain part of the email. Typically I’m checking if MX record for given domain exists, but there is no need to check it for all of them. This is why I want to retrieve them from the database and count most used ones.
Counting unique domains in SQL Server
In the SQL server environment, we are simply searching for the occurrence of the ‘@’ sign and retrieving everything that is on the right of it. Grouping and counting is the no-brainer. Because I want to have the most used domains first, I also ordered the results.
SELECT RIGHT(email_field, LEN(email_field) - CHARINDEX('@', email_field)) domain, COUNT(email_field) domain_count FROM TableWithEmailField WHERE LEN(email_field) > 0 GROUP BY RIGHT(email_field, LEN(email_field) - CHARINDEX('@', email_field)) ORDER BY domain_count DESC
Counting unique domains in MySQL
In MySQL, the substring_index function is doing most of the job. As for SQL Server, grouping and counting are apparent; ordering is handy to have most used domains on top of the results.
SELECT SUBSTRING_INDEX( email, '@', -1 ) domain, COUNT( * ) email_count FROM `shoppers` WHERE LENGTH( email ) >0 GROUP BY SUBSTRING_INDEX( email, '@', -1 ) ORDER BY email_count DESC
Is it worth trying?
I noticed that over 60% of email addresses in the databases I have access to belong to the top 10 domains. This means 60% fewer calls to the DNS when checking for the MX record. If you want to minimize DNS calls, you can also cache your queries and skip the ones you already tested. I found that top 10 is good enough for now. Here is my list of most used email domains:
- google.com
- yahoo.com
- hotmail.com
- aol.com
- comcast.net
- att.net
- verizon.net
- msn.com
- live.com
- outlook.com
Please note that most of them contain users from the US, so if you are operating in the different area, your top 10 list may differ.