Do you have adblock enabled?
 
If you can read this, either the style sheet didn't load or you have an older browser that doesn't support style sheets. Try clearing your browser cache and refreshing the page.

(Ars Technica)   Anonymous tracking data is only anonymous if nobody screws up the anonymizing. In other news, data on 173 million NYC cab trips are now available   (arstechnica.com ) divider line
    More: Fail, hash function, AES  
•       •       •

1034 clicks; posted to Geek » on 24 Jun 2014 at 10:08 AM (2 years ago)   |   Favorite    |   share:  Share on Twitter share via Email Share on Facebook   more»



22 Comments     (+0 »)
 
View Voting Results: Smartest and Funniest
 
2014-06-24 08:29:38 AM  
2. Cross-reference to an index of NYC escort services.
3. Profit!
 
2014-06-24 09:01:29 AM  
Eh, if it was tracking who the passengers were, it would be a big deal. But the cabbies? What benefit does a bad guy get out of that info?
 
2014-06-24 09:02:01 AM  
At least they put the effort in and it wasnt a printout left on some hookers nightstand.
 
2014-06-24 10:15:09 AM  
I was hoping to see a map of the routes, that article was disappointing
 
2014-06-24 10:20:30 AM  
173 million, so a weeks worth?
 
2014-06-24 10:22:33 AM  
2. Cross index between investment banking houses and acquisition candidates
3. Profit

2. Cross index between bankruptcy lawyers and large corporations
3. Profit
...

There are all sorts of ways to data mine where people are going.
 
2014-06-24 10:34:02 AM  
Hmm more:

Cross index frequency of rides from particular locations - get attendance data. Estimate event revenue. More trading opportunities.

Cross index frequency of rides vs. length - get profitability for where to put your cab each hour of the day.

Cross index clients to advertising firms - estimate product roll out and ad campaign dates

Cross index companies to governmental locations - figure out who is talking with/influencing municipal government

Pick out patterns in travel for rich residences - know when they are empty.

I bet people who do it professionally could find hundreds of uses.
 
2014-06-24 10:37:15 AM  

BigSlowTarget: Hmm more:

Cross index frequency of rides from particular locations - get attendance data. Estimate event revenue. More trading opportunities.

Cross index frequency of rides vs. length - get profitability for where to put your cab each hour of the day.

Cross index clients to advertising firms - estimate product roll out and ad campaign dates

Cross index companies to governmental locations - figure out who is talking with/influencing municipal government

Pick out patterns in travel for rich residences - know when they are empty.

I bet people who do it professionally could find hundreds of uses.



You could do all this even if they'd properly anonymized the data.  The breach here was driver identity.
 
2014-06-24 10:49:40 AM  
>You could do all this even if they'd properly anonymized the data.  The breach here was driver identity.

Hm. I think I found a business opportunity. I wonder how many people are already doing it.
 
2014-06-24 11:08:13 AM  
There are a couple of government financial agencies doing "big Data" projects right now, and this is their recurring nightmare.   They Really want to do the projects because it will let them make intelligent regulations based on what's ACTUALLY going on in markets.   But on the other hand its more than a little creepy knowing a government agency just bought 5 million credit histories and is actively data-mining and tracking them over time (or that you or anyone else could buy this data too)

The compromise they've worked out is to buy the data, but then send it to a 3rd party contractor who will strict "de-identify" it, so that names, Social security numbers, phone numbers and Zip codes (and other data fields that some experts have identified as potentially allowing the reverse engineering of an identity) and just replace all that with a random-generated ID number

They all live in terror of someone screwing up that process and allowing "live" data through, because Congress would be able to dine out on a mistake like that for years
 
2014-06-24 11:17:22 AM  
Bet there weren't any Sprint Points left for a security review.


//It's in the backlog on Jira, priority 9999
 
2014-06-24 11:52:22 AM  
serial_crusher:  But the cabbies? What benefit does a bad guy get out of that info?

That all depends on who one views as the bad guy. Imagine a jealous spouse of a cabby driver whose been told the cabbie was out on his rounds picking up fares when he was in fact at his mistresses house. The premise implicit in your comment is that cab drivers have no privacy rights or interests to defend, which I think is nonsese regardless of what one thinks of the cab industry in general.
 
2014-06-24 11:54:35 AM  
How about NOT including those columns in the data query at all?

I have a table with columns A, B, C, D, and E. Let's say B and E are sensitive data. Why not exclude those columns from your query in the first place?

The data I hand over only has information from A, C, and D.

Why the hell are they handing over ALL the information whether it is encrypted or not? I only give you what you need and no more.
 
2014-06-24 12:05:51 PM  

worlddan: serial_crusher:  But the cabbies? What benefit does a bad guy get out of that info?

That all depends on who one views as the bad guy. Imagine a jealous spouse of a cabby driver whose been told the cabbie was out on his rounds picking up fares when he was in fact at his mistresses house. The premise implicit in your comment is that cab drivers have no privacy rights or interests to defend, which I think is nonsese regardless of what one thinks of the cab industry in general.


Sorry, I really didn't mean to put cab drivers down or disparage their profession, just hadn't considered that angle of attack.  You're right though.  A driver might have lied about being at work, just like a rider might have lied about where he was.  They both have similar risk.
 
2014-06-24 12:20:40 PM  

Stunt_Cat: How about NOT including those columns in the data query at all?

I have a table with columns A, B, C, D, and E. Let's say B and E are sensitive data. Why not exclude those columns from your query in the first place?

The data I hand over only has information from A, C, and D.

Why the hell are they handing over ALL the information whether it is encrypted or not? I only give you what you need and no more.


It's possible that the manner of record-keeping makes the sensitive information inseparable from the other information if you want to do anything useful with it.
 
2014-06-24 12:27:15 PM  
I don't understand at all why someone would use MD5 hash of a sensitive field to  anonymize a data set.

Why would one use the data that needs to be obscured as an input into generating the "anonymous" value?

It is trivially simple to generate a GUID for each unique value instead, and then there's no potential link between the real value and the anonymized value.

Itdoesn't guarantee that that the non-anonymized fields can't be analyzed to infer the identity, but it completely eliminates any possibility that your algorithm can be decrpyted or otherwise reverse engineered and it is computationally very cheap and fast to do.

Maybe the IT guy at the taxi company just knew about MD5 hashes, but not GUIDs.  It's a shame when people attack a new problem using only the tools they already know about instead of doing a minimal amount of research to understand if there is a known good way to solve the problem.
 
2014-06-24 12:57:05 PM  

adrift1827: Maybe the IT guy at the taxi company just knew about MD5 hashes, but not GUIDs. It's a shame when people attack a new problem using only the tools they already know about instead of doing a minimal amount of research to understand if there is a known good way to solve the problem.


All security bugs come down to laziness and/or disagreement about the severity of risk.  Single line of SQL gets what they got for this, i.e:

SELECT MD5(DriverId), StartLocation, EndLocation, StartTime, EndTime FROM Trips

For the GUIDs to work I'd have to do extra steps to allocate a GUID for each driver, etc etc.

Somebody probably stuck it on a DBA's desk at 4:30 on a Friday.
 
2014-06-24 01:09:34 PM  

adrift1827: I don't understand at all why someone would use MD5 hash of a sensitive field to  anonymize a data set.

Why would one use the data that needs to be obscured as an input into generating the "anonymous" value?

It is trivially simple to generate a GUID for each unique value instead, and then there's no potential link between the real value and the anonymized value.

Itdoesn't guarantee that that the non-anonymized fields can't be analyzed to infer the identity, but it completely eliminates any possibility that your algorithm can be decrpyted or otherwise reverse engineered and it is computationally very cheap and fast to do.

Maybe the IT guy at the taxi company just knew about MD5 hashes, but not GUIDs.  It's a shame when people attack a new problem using only the tools they already know about instead of doing a minimal amount of research to understand if there is a known good way to solve the problem.


You mean the Agile Developer at the software startup hired to create the metrics tracking for the cab company? I really don't think IT would have been the ones coding, most IT shops don't employ coders any more.

This has "security illiterate developer team" written all over it.
 
2014-06-24 01:25:42 PM  

wxboy: Stunt_Cat: How about NOT including those columns in the data query at all?

I have a table with columns A, B, C, D, and E. Let's say B and E are sensitive data. Why not exclude those columns from your query in the first place?

The data I hand over only has information from A, C, and D.

Why the hell are they handing over ALL the information whether it is encrypted or not? I only give you what you need and no more.

It's possible that the manner of record-keeping makes the sensitive information inseparable from the other information if you want to do anything useful with it.


The data is useless without at least the medallion number.  You have to be able to run queries per vehicle ID.

They did a straight hash which was trivial to reverse because the entire universe of medallion numbers is known and the format is rigidly-structured.
 
2014-06-24 01:39:19 PM  
The problem is that Bloomberg banned all the Salt!
 
2014-06-24 01:52:47 PM  

Generation_D: adrift1827: I don't understand at all why someone would use MD5 hash of a sensitive field to  anonymize a data set.

Why would one use the data that needs to be obscured as an input into generating the "anonymous" value?

It is trivially simple to generate a GUID for each unique value instead, and then there's no potential link between the real value and the anonymized value.

Itdoesn't guarantee that that the non-anonymized fields can't be analyzed to infer the identity, but it completely eliminates any possibility that your algorithm can be decrpyted or otherwise reverse engineered and it is computationally very cheap and fast to do.

Maybe the IT guy at the taxi company just knew about MD5 hashes, but not GUIDs.  It's a shame when people attack a new problem using only the tools they already know about instead of doing a minimal amount of research to understand if there is a known good way to solve the problem.

You mean the Agile Developer at the software startup hired to create the metrics tracking for the cab company? I really don't think IT would have been the ones coding, most IT shops don't employ coders any more.

This has "security illiterate developer team" written all over it.


Per TFA the data came from the city government: "City officials released the data in response to a public records request"

Hiring a third party dev team to dump data you already have seems excessive to me, but also the kind of excessive that a government agency might get into.
 
2014-06-24 04:49:56 PM  

worlddan: serial_crusher:  But the cabbies? What benefit does a bad guy get out of that info?

That all depends on who one views as the bad guy. Imagine a jealous spouse of a cabby driver whose been told the cabbie was out on his rounds picking up fares when he was in fact at his mistresses house. The premise implicit in your comment is that cab drivers have no privacy rights or interests to defend, which I think is nonsese regardless of what one thinks of the cab industry in general.


I can hunt down the driver and make him tell me who he saw with my wife that night.
 
Displayed 22 of 22 comments

View Voting Results: Smartest and Funniest

This thread is archived, and closed to new comments.

Continue Farking
Submit a Link »
On Twitter








In Other Media
  1. Links are submitted by members of the Fark community.

  2. When community members submit a link, they also write a custom headline for the story.

  3. Other Farkers comment on the links. This is the number of comments. Click here to read them.

  4. Click here to submit a link.

Report