Call us Today +1 (888) 240-8088

The FBI reviewed all of the 650,000 emails from the laptop belonging to ex-Congressman Anthony Weiner. Read about how they were able to finish the process so quickly.

hillary clinton image

FBI Director James Comey aroused suspicion late on the Sunday before Election Day when he announced that the agency had quickly reviewed a reported store of 650,000 emails recovered from a laptop belonging to disgraced ex-Congressman Anthony Weiner (husband of top Hillary Clinton aide Huma Abedin) and reconfirmed that Clinton would face no criminal charges. Donald Trump and his supporters were suspicious: how would it be possible to review so many emails in such a short time?

After the New York Times, Wired and even Edward Snowden confirmed that deduplication was at the core of the rapid review, we spoke with John Kosturos of RingLead to get more details on the data science behind the speedy review.

How could the FBI examine 650,000 emails and be sure that these are the same ones that they already had?

Kosturos: You’re talking about finding patterns in subject lines, email addresses, the body of an email, the text. What we do in identifying duplicates in a database is to put algorithms on field values to see if they’re identical or if they’re very similar. It’s similar to deduping a database. If you have access to the data in the emails where the subject is in the same place every time, the email address is available, the body of the email is available, then you could put basic pattern-matching algorithms on those emails en masse and do a realtime crawl through that inbox. Once you have the data, you can process it quickly because it’s just a series of language where you’re identifying: is the subject exactly the same? Is it the same email? Are the first 10 words the same? Being able to run a technology on that database, it probably only takes a few hours to crunch the results of any emails that were duplicates based on the script that was written.

If the data is not in such a consistent format, does that add a wrinkle to the process? If they came from Hillary Clinton’s server, they may have been downloaded by a different email client. Some were recovered, not from Hillary’s server, but from the people that she was emailing with.

Data can be in all sorts of formats. If you think about a company name, it can be spelled 20 or 30 different ways. If we’re trying to identify companies that are duplicates in a database, before we actually even do the lookup, we standardize the values and we’ve got some patented technology that will basically get rid of heading and move the words here, there.

The goal is to identify the true convention for that particular piece of data. If you’re talking about identifying emails that are similar, there may be some formatting that’s different or it may be in a different type of file. But once you can break it down into, “hey, this piece is what the subject would be,” and there might be some characters that you might have to strip out at the beginning or the end. Again, you can process that for the pattern matching. There is a level of data that needs standardization or normalization that you would want to apply prior to searching because the different formats wouldn’t be able to pick them up if they didn’t strip out what was not uniform.

Would that be a time consuming process if you’re talking about potentially 650,000 emails?

It just depends on how many different patterns you’re normalizing against. If you had four or five different formats, then you only have to create four or five different formats in your language. If you have millions of variations, you’d have to create millions of variations. But for hers, assuming there’s only three, four, five different formats that the data could be in, it wouldn’t take long to create that normalization structure. Then, once that’s finished, running the script against the emails? Two, three, four hours is tops I would say it takes to actually execute the job.

So when you hear that the FBI reviewed that volume of emails over the course of a week and came to this conclusion, that doesn’t raise any red flags in your mind or make you suspicious about whether that was a practical task for them to do?

No, not at all. I think that they can do it. The level of computing power and the data scientists that they have, they would be able to do that pretty quickly.

Would it surprise you that that volume of emails would turn out to all be perfectly matched? That they would come back in a week and say, “Really, there’s just nothing different. This is all clear.”

If it’s somebody else’s computer, they might not all be the same, but they could be parts of emails streams where somebody gets adds to the stream. Possibly a piece of the message is similar but not the whole thing. I would say there’s probably some overlap when the person was added to an email chain or forwarded a message.

Could there be uniformity if the machine that they’re looking at was just perfectly synchronizing with another database, i.e., email on another server?

Highly possible. With cloud computing, you can pretty much share a profile on any machine. I have people that are my assistants, they’ll log into my email so they can send emails for me and they’re on a different machine. With cloud computing, it wouldn’t be too far-fetched.

Does anything in this scenario tap into the kind of work that you do at RingLead? Are there any analogies to the sales and marketing universe?

For us, it’s educating all sales and marketing people that time is talent and energy is their money. By investing a small amount of time working on data — deduping it, normalizing, giving it a format that they can actually digest and utilize — the more time they’ll have to actually go out and work on that information. What RingLead does is help solve highly complex data challenges, that requires a high level expertise. If you don’t have that, you could spend hundreds or thousands of hours on a task that could be automated into just a few hours.

This FBI story is just a good example of a highly complex data challenge. If you went through and manually looked up all those emails, it would take you hundreds of hours, but if you work with a company like RingLead that has a background in data science and matching data, we can get it to a more digestible state in a much faster time.

The author:

capture,ringlead,sales rebuttals,data quality checklist,rebuttal sales,sales rebuttals list,bad data,ringleads,sales statistics,b2b diagram,crm customer satisfaction,sales stats,call voicemail,data quality analysis tools,data standardization,why is data management important,salesforce hacks,salesforce connections 2015,how to merge accounts in salesforce,salesforce merge accounts,rebuttals for sales,merge accounts salesforce,deduping,not always right,rebuttals in sales,importance of data management,salesforce phone number format,how to standardize data,salesforce implementation,sales motivation video,quality data management,salesforce sucks,motivational sales videos,standardize data,sales motivational videos,list of sales rebuttals,lost lead,sales rebuttal examples,data standardization process,data quality audit tool,best motivational sales videos,what is standardized data,standardizing data,contact capture,importance of data,salesforce customers,why data management is important,improving data quality,data accessibility,crm best practices,merge accounts in salesforce,salesforce administrator resume,salesforce data management,spooky lines,donato diorio,web to lead,data quality improvement strategy,sales team motivation video,dedupe tool,sales motivation,marketo address,sales rebuttals examples,sales motivational speech to sales staff,customers are not always right,data enhancement definition,data quality manager,data quality manager,salesforce certification,marketing automation expert,how to improve data quality,crm and customer satisfaction,database normalization,company swag ideas,salesforce chatter use cases,best motivational videos for sales meetings,sphere of influence definition,sales image,jewel restaurant,sales inspirational videos,salesforce implementation process,reasons why the customer is not always right,importance of quality audit,importance of data quality,data management companies,the customer is not always right,data quality audit,dms launch,why the customer is not always right,salesforce largest customers,staffing procedure,benefits of using a database management system,best sales rebuttals,what is a hot lead in sales,sales motivation youtube,standardize the data,salesforce web to lead spam,types of database management systems,types of database management system,standardising data,improve data,data quality checklist,quality data management,data quality analysis tools,duplicates in marketo,improve data quality,marketo duplicates,merge marketo duplicates,marketo duplicate leads,marketo duplication,data quality manager,data quality improvement,salesforce dedupe,deduplication in marketo,data mining techniques in crm,marketo deduplication,managing data quality,merge marketo contacts,merge marketo accounts,data preparation for salesforce,salesforce phone number format,data quality audit tool,data management for salesforce,salesforce deduplication,merge marketo leads,salesforce data cleaning,salesforce data cleansing,salesforce data management,data quality management tools,data enhancement,data quality platform,data management companies,salesforce merge,data wrangling for salesforce,capture,data enhancement software,salesforce lead capture,salesforce duplicates,lead capture tools,salesforce merge leads,data quality assessment tools,data cleansing,capture tool,best data quality tools,lead prospecting software,data quality management software,data quality plan,data quality management,how to capture leads,prospecting and sales tool,salesforce duplicate contacts,data quality tool,prospecting tool,data cleansing software,database normalization,marketo cost,data operations,salesforce data quality,prospecting tools for sales,data quality objectives,data quality salesforce,lead tool,list building tools,salesforce migration,data quality for salesforce,lead prospecting,list building,lead prospecting sales,website capture tool,prospecting lead,data quality management platform,salesforce chrome plugin,contact google sales,google sales contact,salesforce wrangling,prospecting leads,preparation for salesforce,sales prospecting tools,data cleansing platform,salesforce tool,tool capture,salesforce data migration,sales lead prospecting,prospect tool,prospecting on linkedin,linkedin prospecting,capture emails,data quality software,linkedin salesforce,list builder,data management,data quality,data solutions,web crawling tool,prospecting for sales,how to prospect for sales leads,prospecting tools,salesforce chrome,web crawling tools,smart prospecting,linkedin tools,salesforce linkedin,linkedin for salesforce,research tool google docs,sales prospecting software,sales prospecting,prospecting in sales,sales prospecting sheet,data management platform,data management software,data enhancement platform,database management,sales prospect,marketing automation pricing,data operations software,salesforce chrome extension,prospecting sales,sales lead sheets,prospect research,data base management,sales force tool,marketo price,marketing automation,sales force tools,database management systems,salesforce tools,data operations platform,salesforce preparation,data management solutions,salesforce crm tool