SQL Server


If you manage, write, visit, or otherwise have anything to do with a web app that connects to a SQL Server database, good guy and Microsoft Program Manager Buck Woody wants you to read this:

[copied with permission from here]

You might have read recently that there have been ongoing SQL injection attacks against vulnerable web applications occurring over the last few months.  These attacks have received recurring attention in the press as they pop up in various geographies around the world. These attacks do not leverage any SQL Server vulnerabilities or any un-patched vulnerabilities in any Microsoft product – the attack vector is vulnerable custom applications. In fact, SQL Injection is a coding issue that can attack any database system, so it’s a good idea to learn how to defend against them.

In order to help you respond to and defend yourself from these attacks, Microsoft has an authoritative blog including talking points and guidance.  You can find this at this Technet location. (Retype the underlying URL if you like. I only linked it this way because it wrapped.)

Ok, if you didn’t visit the Technet link, visit it before reading on.

Thanks. Now I’ll add another bit of advice:

There’s a non-SQL injection issue here as well. The risk in question starts when a web application incorporates part of the URL into SQL and executes it blindly (SQL injection), but the risk to end users only occurs because the web app commits “HTML
injection.” The web app unwittingly delivers a malicious bit of HTML that says “Hey browser, please run a script from this other web site.” That malicious bit of HTML won’t be sent to my browser if the web application doesn’t blindly incorporate table data (especially table data containing HTML tags) into the HTML pages it delivers.

Here’s an analogy. When you fill a prescription, you get instructions like “Take one pill twice a day for seven days.” Those instructions probably get printed out of some database. If the instructions say “Chew up all the pills and wash them down with a cup of bleach,” something’s wrong with the pharmacy’s database. Something’s also wrong with the pharmacy for not catching the bogus instructions before dispensing the prescription. And if you follow the instructions, something’s wrong with you.

The risk Buck is drawing our attention to is like this, and the Technet blog tells us to secure our database. Just as importantly, we should pay attention to what we dispense, and not just assume that if we’re dispensing our data, it’s good data. Browsers often render (and in the case of scripts, execute) whatever a trusted site sends them, and if trusted sites send HTML out without vetting it, well, they shouldn’t be trusted. If you’re a web developer and you want your site to be trusted, then vet what you deliver.

I don’t do web apps, but I don’t think a responsible web app should send me script tags that refer to third-party sites. In fact, the web app probably shouldn’t send me any table data without scrubbing it for tags, non-printing ASCII characters, etc.

Many years ago, we thought it was funny to email people BEL characters, and then someone figured out email shouldn’t be allowed to contain BEL. Years ago bulletin boards figured out they shouldn’t allow users to put any old HTML into their posts.
The threat then was still minor - jokers figured out they could mess up some bulletin board formatting by posting opening tags without closing them. Apparently this was only half fixed. Web apps typically scrub what comes in through the expected channels, but a lot of web apps (most?) apparently don’t scrub the HTML they send out. They should. In fact, they must, now that the bad guys have figured out how to exploit sloppy web apps to modify table data bypassing the expected route. The bad guys may soon find some more sloppy code and exploit it to mess with your data.

Just as it’s possible to scrub outgoing email for viruses, it should be possible (and routine) to scrub outgoing HTML for malicious content. While I don’t trust email attachments that have a “no viruses” sticker on them, and I wouldn’t trust a random site that tells me “this web page is safe,” I would trust Microsoft or another trustworthy source if they told me their web servers scrub all outgoing web pages for unexpected script tags.

Before SQL Server 2005 was released, a calculation that requiring a ranking was both relatively difficult to express as a single query and relatively inefficient to execute. That changed in SQL Server 2005 with support for the SQL analytic functions RANK(), ROW_NUMBER(), etc., and partial support for SQL’s OVER clause.

Spearman’s rho (Spearman’s correlation coefficient) is a useful statistic that can be calculated more easily in SQL Server 2005 than in earlier versions. Below is an implementation of Spearman’s rho for SQL Server 2005 and later.

SQL’s RANK() and the rank order required for the calculation of Spearman’s rho are slightly different: if for example four values are tied for third place, RANK() will equal 3 for all four of them. The Spearman’s formula requires them all to be ranked 4.5, the average of their positions (3rd, 4th, 5th, and 6th) in an ordered list of the data. To address this difference, the code below adjusts the SQL RANK() by adding to it 0.5 for each occurrence of a data value beyond the first. I used COUNT(*) with an OVER clause for this.

The script below demonstrates the calculation for two data sets. The first one is from Wikipedia’s page on Spearman’s rho; I made up the second data set to include duplicate data values. I haven’t tested the code thoroughly, but for a variety of small test data sets, it matches hand calculations and the result here [1].

create table SampleData (
ID int identity(1,1) primary key,
x decimal(5,2),
y decimal(5,2)
);

insert into SampleData(x,y) values(106,7);
insert into SampleData(x,y) values(86,0);
insert into SampleData(x,y) values(100,27);
insert into SampleData(x,y) values(101,50);
insert into SampleData(x,y) values(99,28);
insert into SampleData(x,y) values(103,29);
insert into SampleData(x,y) values(97,20);
insert into SampleData(x,y) values(113,12);
insert into SampleData(x,y) values(112,6);
insert into SampleData(x,y) values(110,17);
go

create procedure Spearman as
with RankedSampleData(ID,x,y,rk_x,rk_y) as (
select
ID,
x,
y,
rank() over (order by x) +
(count(*) over (partition by x) - 1)/2.0,
rank() over (order by y) +
(count(*) over (partition by y) - 1)/2.0
from SampleData
)
select
1e0 -
(
6
*sum(square(rk_x-rk_y))
/count(*)
/(square(count(*)) - 1)
)
from RankedSampleData;
go

exec Spearman;

go
truncate table SampleData;
go

insert into SampleData(x,y) values(1,3);
insert into SampleData(x,y) values(3,5);
insert into SampleData(x,y) values(5,8);
insert into SampleData(x,y) values(3,4);
insert into SampleData(x,y) values(4,7);
insert into SampleData(x,y) values(4,6);
insert into SampleData(x,y) values(3,4);
go

exec Spearman;
go

drop proc Spearman;
drop table SampleData;

[1] Wessa, P. (2008), Free Statistics Software, Office for Research Development and Education, version 1.1.22-r4, URL http://www.wessa.net/

Finding elapsed time in SQL Server is easy, so long as the clock is always running: just use DATEDIFF. But you often need to find elapsed time excluding certain periods, like weekends, nights, or holidays. A fellow SQL Server MVP recently posed a variation on this problem: to find the number of minutes between two times, where the clock is running only from 6:00am-6:00pm, Monday-Friday. He needed this to compute how long trouble tickets stayed at a help desk that was open for those hours.

I came up with a function DeskTimeDiff_minutes(@from,@to) for him. It requires a permanent table that spans the range of times you might care about, holding one row for every time the clock is turned on or off, weekdays at 6:00am and 6:00pm in this case.

The table also holds an “absolute business time” in minutes (ABT-m): the total number of “help desk open” minutes since a fixed but arbitrary “beginning of time.” Elapsed help desk time is then simply the difference between ABT-m values. While the table only records the ABT-m 10 times a week, you can find the ABT-m for an arbitrary datetime @d easily. Find the row of the table with time d closest to @d but not later. In that row you’ll find the ABT-m at time d, and you’ll also find out whether the clock was (or will be) running or not between d and @d. If not, the ABT-m at time @d is the same as at time d. Otherwise, add the number of minutes between d and @d.

Here’s the code. The reference table here is good from early 2000 until well past 2050, and you can easily extend it or adapt it to other business rules. A larger permanent table of times shouldn’t affect performance, because the function only performs (two) index seek lookups on the table.

If you cut and paste this for your own use, watch out for “smart quotes” or other Wordpress/Live Writer formatting quirks.

create table Minute_Count(
  d datetime primary key,
  elapsed_minutes int not null,
  timer varchar(10) not null check (timer in (’Running’,'Stopped’))
);

insert into Minute_Count values (’2000-01-03T06:00:00′,0,’Running’);
insert into Minute_Count values (’2000-01-03T18:00:00′,12*60,’Stopped’);

insert into Minute_Count values (’2000-01-04T06:00:00′,12*60,’Running’);
insert into Minute_Count values (’2000-01-04T18:00:00′,24*60,’Stopped’);

insert into Minute_Count values (’2000-01-05T06:00:00′,24*60,’Running’);
insert into Minute_Count values (’2000-01-05T18:00:00′,36*60,’Stopped’);

insert into Minute_Count values (’2000-01-06T06:00:00′,36*60,’Running’);
insert into Minute_Count values (’2000-01-06T18:00:00′,48*60,’Stopped’);

insert into Minute_Count values (’2000-01-07T06:00:00′,48*60,’Running’);
insert into Minute_Count values (’2000-01-07T18:00:00′,60*60,’Stopped’);
/* any Monday-Friday week */

declare @week int;
set @week = 1;
while @week < 2100 begin
  insert into Minute_Count
    select
      dateadd(week,@week,d),
      elapsed_minutes + 60*@week*60,
      timer
  from Minute_Count
  set @week = @week * 2
end;

go

create function DeskTimeDiff_minutes(
  @from datetime,
  @to datetime
) returns int as begin
  declare @fromSerial int;
  declare @toSerial int;
  with S(d,elapsed_minutes,timer) as (
    select top 1 d,elapsed_minutes, timer
    from Minute_Count
    where d <= @from
    order by d desc
  )
    select @fromSerial =
      elapsed_minutes +
      case when timer = ‘Running’
      then datediff(minute,d,@from)
      else 0 end
    from S;
  with S(d,elapsed_minutes,timer) as (
    select top 1 d,elapsed_minutes, timer
    from Minute_Count
    where d <= @to
    order by d desc
  )
    select @toSerial =
      elapsed_minutes +
      case when timer = ‘Running’
      then datediff(minute,d,@to)
      else 0 end
    from S;
  return @toSerial - @fromSerial;
end;
go
select MAX(d) from Minute_Count
select dbo.DeskTimeDiff_minutes(’2007-12-19T18:00:00′,’2007-12-24T17:51:00′);
go

drop function DeskTimeDiff_minutes;
drop table Minute_Count;

Microsoft plans to support spatial data types in SQL Server 2008, and a preview is available to the community in the latest CTP (community technology preview), available here.

John O’Brien, a Windows Live Developer MVP, has been trying out the new spatial types in some cool Virtual Earth projects (John’s site is here), and in one of his projects, SQL Server threw an interesting error message. When he zoomed far enough out in Virtual Earth, then tried to create a polygon from the map bounds, SQL Server reacted with:

“The specified input does not represent a valid geography instance because it exceeds a single hemisphere. Each geography instance must fit inside a single hemisphere. A common reason for this error is that a polygon has the wrong ring orientation.”

John found a workaround, dividing the map into two pieces, but he was interested to know what the SQL Server folk thought about the situation. Here’s my reply. It’s less a response to John’s inquiry than it is a ramble about geometry and what hemispheres and orientation have to do with how you can or can’t specify polygons.

To begin, think of the earth’s Equator as a polygon. How would you answer the following questions?

  • “If I travel Eastbound around the earth along the equator, have I gone clockwise or counter-clockwise?”
  • “Is the north pole inside the equator or outside the equator?”

In the plane (or on a flat map of the world), a polygon or other closed non-self-intersecting curve has a well-defined “inside” and “outside”. A polygon separates the plane into two regions, one that has finite area and one that is unbounded. The finite region is deemed “inside” the polygon. On a sphere, however, a closed curve determines two finite regions, either of which might be what someone thinks of as the inside.

For example, the four-sided outline of the US state of Wyoming separates the earth into what you could call “Wyoming” and “anti-Wyoming.” But are we so sure which is the inside and which is the outside? Our intuition is that the smaller region is always the inside, but there’s nothing about geometry and geography to tell us that. Maybe Wyoming is most of the world. A single geographic region could contain most of the earth’s surface within its borders, couldn’t it?

Suppose Wyoming declared itself to be Great Wyoming and annexed all of North America, Europe, and continued to conquer the world. Suppose its armies crossed the equator and eventually took over almost everything—everything but Antarctica, in fact.

Then the boundary of Great Wyoming would then be the same as the boundary of Antarctica. You would probably want Great Wyoming to be inside the boundary of Great Wyoming and Antarctica to be inside the boundary of Antarctica, but how can that work—the boundaries are the same?

This is a problem. On a sphere, the naïve idea of interior/exterior isn’t well-defined. One solution would be to pass a law that every polygon on earth must fit inside a single hemisphere with room to spare. We could then define the interior of a polygon to be the smaller of the two regions it determines. This would place Antarctica, not Wyoming, within the borders of Great Wyoming—wrong, but unambiguous. And anyway, who would ever need to consider a region bigger than 640K that doesn’t fit inside a single hemisphere?

Fortunately, though, we don’t have to abandon or compromise the notion of interior and exterior on the earth’s surface: Antarctica can remain outside Greater Wyoming. All we need to do is be precise about the direction in which we describe a polygon. When specifying the boundary of a region, you can give a forwards/backwards or clockwise/counter-clockwise sense to the boundary by choosing the way you order the list of vertices. List them so that what you consider inside the region is on your left as you “connect the dots,” because we will adopt the convention that the left side as you walk the perimeter is the inside. What’s on the right will be interpreted as outside. Now you can describe the boundary of Great Wyoming. Just describe it as drawn from west to east, so Antarctica is on the right (exterior). (This works because a sphere is an “orientable surface.” SQL Server’s new geography data type isn’t supported on a Klein bottle, where CultureInfo.IsOrientableWorld—if such a property existed—would be false.)

Once we require polygons to be oriented, there’s no need to require that they fit within a single hemisphere, but nonetheless, SQL Server 2008’s geography data type adopts the hemisphere requirement. For geometry objects of type Polygon, I think this is a good idea. I’m not sure whether it’s a standard GIS requirement or just SQL Server’s, but it prevents users from accidentally entering the coordinates of Wyoming in clockwise fashion only to discover later that Perth and Addis Ababa, but not Cheyenne, are in Wyoming. [For some of the other geography types, such as LineString, I don’t see a benefit from requiring the object to fit in a hemisphere, but consistency isn’t a bad thing.]

Groundbreaking when it was published in 1955, the classic book “A Million Random Digits with 100,000 Normal Deviates” has been republished electronically by the RAND corporation with permission “to duplicate this electronic document for personal use only, as long as it is unaltered and complete.” Books like these were a staple of statistical research in the mid-20th century, and this particular one was highly revered.

Nowadays, there are better sources of random numbers, such as HotBits, and there are many ways to generate pseudorandom numbers, which are not random, but have many of the properties of random number and are useful for many purposes.

I hope it’s not a violation of the copyright for me to provide instructions on how to use SQL to load the book’s content in its published format (or any identically-formatted list) into a SQL table that can be queried for random (not pseudorandom) sequences of numbers. The script uses a few of SQL Server 2005’s new features, including the BULK rowset provider for text files, some of the new analytic functions, and TOP with a variable. You’ll also need a table-valued function called Numbers(), like the one in my previous SQL post.

The RAND book is available here, and my script works for the support file “Datafile: A Million Random Digits,” available for download here. The SQL Server 2005 script below assumes you’ve downloaded this file and unzipped it to C:\\RAND\\MillionDigits.txt.

The beginning of the file looks like this

00000   10097 32533  76520 13586  34673 54876  80959 09117  39292 74945
00001   37542 04805  64894 74296  24805 24037  20636 10402  00822 91665
00002   08422 68953  19645 09303  23209 02560  15953 34764  35080 33606
00003   99019 02529  09376 70715  38311 31165  88676 74397  04436 27659
00004   12807 99970  80157 36147  64032 36653  98951 16877  12171 76833
00005   66065 74717  34072 76850  36697 36170  65813 39885  11199 29170
00006   31060 10805  45571 82406  35303 42614  86799 07439  23403 09732
00007   85269 77602  02051 65692  68665 74818  73053 85247  18623 88579
00008   63573 32135  05325 47048  90553 57548  28468 28709  83491 25624
00009   73796 45753  03529 64778  35808 34282  60935 20344  35273 88435

Unix-style newlines (0x0A) are used, and the million digits are organized into 20,000 five-digit integers with leading zeroes, so the script will import the file into a table of 20,000 five-digit numbers (as char(5) data with leading zeroes). Here’s the script:   (more…)

One of the things that kept me busy this past winter and spring was tech editing Itzik Ben-Gan’s two books in Microsoft Press’s Inside Microsoft® SQL Serverâ„¢ 2005 series (1,2). Of Itzik’s many clever solutions to programming problems, my favorite was this function that returns a table of consecutive integers. It’s blazingly fast, and it’s the best way I know of to generate a sequence on the fly - probably even better than accessing a permanent table of integers.

create function Numbers(
  @from as bigint,
  @to as bigint
) returns table with schemabinding as return
  with t0(n) as (
    select 1 union all select 1
  ), t1(n) as (
    select 1 from t0 as a, t0 as b
  ), t2(n) as (
    select 1 from t1 as a, t1 as b
  ), t3(n) as (
    select 1 from t2 as a, t2 as b
  ), t4(n) as (
    select 1 from t3 as a, t3 as b
  ), t5(n) as (
    select 1 from t4 as a, t4 as b
  ), Numbers(n) as (
    select row_number() over (order by n) as n
    from t5
  )
    select @from + n - 1 as n
    from Numbers
    where n <= @to - @from + 1

Estimated row size in bytes is an important factor used by the SQL Server optimizer to estimate query cost, and I’ve found an anomaly in the estimated costing algorithm for the Sort operator, as well as in the actual cost of sorting long data.The estimated cost of a Sort seems to take a giant jump when the estimated row size exceeds 4000 bytes, but that jump in estimated cost doesn’t correspond to any jump in actual cost.

It’s important to note that the jump does not depend on the length of the sort key, but only on the length of the row data being carried along. The cost estimate for sorting a estimated-to-be-long row on a short key is much greater than for sorting an estimated-to-be-medium-length row on the same short key.   (more…)