Showing posts with label approach. Show all posts
Showing posts with label approach. Show all posts

Friday, March 30, 2012

please help me figure out how best to approach my queue problem

Hi,

I am moving my system from a custom created queue to a service broker queue.

I wish to have two queues :

one to hold processing messages
and one to hold emailing messages.

I do not understand why i would have to have an initiator queue and a target queue for each of the above. I have arranged my sql as follows:

-- processing queue
CREATE MESSAGE TYPE ProcessingMessage VALIDATION = NONE
CREATE CONTRACT ProcessingContract (ProcessingMessage SENT BY INITIATOR)
CREATE QUEUE ProcessingQueue
CREATE SERVICE ProcessingService ON QUEUE ProcessingQueue (ProcessingContract)
GO

-- emailing queue
CREATE MESSAGE TYPE EmailingMessage VALIDATION = NONE
CREATE CONTRACT EmailingContract (EmailingMessage SENT BY INITIATOR)
CREATE QUEUE EmailingQueue
CREATE SERVICE EmailingService ON QUEUE EmailingQueue (EmailingContract)
GO

So basically EmailingQueue plays the role of the initiator and the target ( sends messages to itself )... and so does the ProcessingQueue.

I enqueue my messages with the following SP:

PROC [dbo].[queue_enqueue] (
@.fromService SYSNAME,
@.toService SYSNAME,
@.onContract SYSNAME,
@.messageType SYSNAME,
@.entryId int)
AS
BEGIN
DECLARE @.conversationHandle UNIQUEIDENTIFIER
DECLARE @.error int

BEGIN DIALOG @.conversationHandle
FROM SERVICE @.fromService
TO SERVICE @.toService
ON CONTRACT @.onContract
with LIFETIME = 6000, ENCRYPTION = off;

SEND ON CONVERSATION @.conversationHandle Message Type @.messageType (@.entryId)

END

I do the enqueueing with
[queue_enqueue] 'ProcessingService', N'ProcessingService', 'ProcessingContract', 'ProcessingMessage', 1

I dequeue my messages with the following SP:
PROC [dbo].[queue_dequeue]
@.queue SYSNAME,
@.entryId int OUTPUT
AS
BEGIN
DECLARE @.conversationHandle UNIQUEIDENTIFIER;
DECLARE @.messageTypeName SYSNAME;
DECLARE @.conversationGroupId UNIQUEIDENTIFIER;
get conversation group @.conversationGroupId from ProcessingQueue

if ( @.conversationGroupId is not null )
BEGIN
if (@.queue='ProcessingQueue')
RECEIVE TOP(1) @.entryId = CONVERT(INT, [message_body]), @.conversationHandle = [conversation_handle], @.messageTypeName = [message_type_name] FROM ProcessingQueue where conversation_group_id = @.conversationGroupId
else if (@.queue='EmailingQueue')
RECEIVE TOP(1) @.entryId = CONVERT(INT, [message_body]), @.conversationHandle = [conversation_handle], @.messageTypeName = [message_type_name] FROM EmailingQueue where conversation_group_id = @.conversationGroupId
END
END

I dequeue by calling something like:
declare @.entryId int
exec [queue_dequeue] 'ProcessingQueue', @.entryId output

The above works however I have a few issues that I am having problems figuring out.

1. I don't want to "fire and forget". I want to close conversation at initiator endpoint only when target has closed conversation.
2. I want to know how to properly close conversations at initiator and target.
3. I am not polling with while loop or wait for, because of the way my system ( higher up ) is setup, I simply want to enqueue one message, and dequeue one message. If an error occurs in enqueueing or dequeueing I want to raise that error.
4. How do I handle poison messages?
5. Is it better for me to use two queues ( initiator and target ) for each of my queues above?

Can someone please help me with these issues?

thanx

Are you sure the code is correct? You are retrieving the conversation group from one queue, but potentially receive from another. Besides, GET CONVERSATION GROUP locks the conversation solely for the duration of the transaciton, yif you don't wrap the call to [queue_dequeue] in a transaction, you can have errors if multiple threads are calling this procedure. Also, using GET CONVERSATION GROUP is recommended solely for the cases when the application needs to look up some application specific state (corellated with the converstion_group_id) before actually RECEIVEing the messages. Otherwise one can simply go ahead and call RECEIVE verb directly.

The Service Broker queues are not identical with the Computer Science 'queue' data structures (i.e. a FIFO list). They are more like message storage areas for services and services are endpoints for distributed applications, communicating through messages. As such, using Service Broker queues probably won't map exactly 1-to-1 with your previous user-table-queue code.

You already noticed that you now have to call END CONVERSATION. I'm not sure how the rest of the system works: are you reusing converstions when enqueueing? is each enqueue a separate dialog?. Probably the [queue_dequeue] would have to call END CONVERSATION to end the conversations on which it received the current message. As long as you keep initiator and target on the same queue, this means that the same [queue_dequeue] procedure will have to handle the EndDialog messages (that would be one reason why you should split the queues).

|||Hi Remus, thanks for your reply!

I admit, I used the GET CONVERSATION GROUP when i got desperate.. thought that RECEIVE wasn't locking the queue entry on its own. I'll get rid of it since we don't really need to reuse conversations, we don't have a message intensive system, we are mostly worried about message intergrity and reduced processor load as compared to our previous system.

So, I gather it is better for me to setup my queues in the service broker conventional way, two queues for each physical queue that I have.

I still dont understand fully how the "dont fire and forget" method works:

1. Initiator starts dialog and sends message. does not end dialog.
2. Target receives message and ends dialog which in turn sends EndDialog message to initiator.
3. Initiator activation SP recieves EndDialog message and ends dialog on initiator side?

am I missing something?

thanx once again for your help.

please help me figure out how best to approach my queue problem

Hi,

I am moving my system from a custom created queue to a service broker queue.

I wish to have two queues :

one to hold processing messages
and one to hold emailing messages.

I do not understand why i would have to have an initiator queue and a target queue for each of the above. I have arranged my sql as follows:

-- processing queue
CREATE MESSAGE TYPE ProcessingMessage VALIDATION = NONE
CREATE CONTRACT ProcessingContract (ProcessingMessage SENT BY INITIATOR)
CREATE QUEUE ProcessingQueue
CREATE SERVICE ProcessingService ON QUEUE ProcessingQueue (ProcessingContract)
GO

-- emailing queue
CREATE MESSAGE TYPE EmailingMessage VALIDATION = NONE
CREATE CONTRACT EmailingContract (EmailingMessage SENT BY INITIATOR)
CREATE QUEUE EmailingQueue
CREATE SERVICE EmailingService ON QUEUE EmailingQueue (EmailingContract)
GO

So basically EmailingQueue plays the role of the initiator and the target ( sends messages to itself )... and so does the ProcessingQueue.

I enqueue my messages with the following SP:

PROC [dbo].[queue_enqueue] (
@.fromService SYSNAME,
@.toService SYSNAME,
@.onContract SYSNAME,
@.messageType SYSNAME,
@.entryId int)
AS
BEGIN
DECLARE @.conversationHandle UNIQUEIDENTIFIER
DECLARE @.error int

BEGIN DIALOG @.conversationHandle
FROM SERVICE @.fromService
TO SERVICE @.toService
ON CONTRACT @.onContract
with LIFETIME = 6000, ENCRYPTION = off;

SEND ON CONVERSATION @.conversationHandle Message Type @.messageType (@.entryId)

END

I do the enqueueing with
[queue_enqueue] 'ProcessingService', N'ProcessingService', 'ProcessingContract', 'ProcessingMessage', 1

I dequeue my messages with the following SP:
PROC [dbo].[queue_dequeue]
@.queue SYSNAME,
@.entryId int OUTPUT
AS
BEGIN
DECLARE @.conversationHandle UNIQUEIDENTIFIER;
DECLARE @.messageTypeName SYSNAME;
DECLARE @.conversationGroupId UNIQUEIDENTIFIER;
get conversation group @.conversationGroupId from ProcessingQueue

if ( @.conversationGroupId is not null )
BEGIN
if (@.queue='ProcessingQueue')
RECEIVE TOP(1) @.entryId = CONVERT(INT, [message_body]), @.conversationHandle = [conversation_handle], @.messageTypeName = [message_type_name] FROM ProcessingQueue where conversation_group_id = @.conversationGroupId
else if (@.queue='EmailingQueue')
RECEIVE TOP(1) @.entryId = CONVERT(INT, [message_body]), @.conversationHandle = [conversation_handle], @.messageTypeName = [message_type_name] FROM EmailingQueue where conversation_group_id = @.conversationGroupId
END
END

I dequeue by calling something like:
declare @.entryId int
exec [queue_dequeue] 'ProcessingQueue', @.entryId output

The above works however I have a few issues that I am having problems figuring out.

1. I don't want to "fire and forget". I want to close conversation at initiator endpoint only when target has closed conversation.
2. I want to know how to properly close conversations at initiator and target.
3. I am not polling with while loop or wait for, because of the way my system ( higher up ) is setup, I simply want to enqueue one message, and dequeue one message. If an error occurs in enqueueing or dequeueing I want to raise that error.
4. How do I handle poison messages?
5. Is it better for me to use two queues ( initiator and target ) for each of my queues above?

Can someone please help me with these issues?

thanx

Are you sure the code is correct? You are retrieving the conversation group from one queue, but potentially receive from another. Besides, GET CONVERSATION GROUP locks the conversation solely for the duration of the transaciton, yif you don't wrap the call to [queue_dequeue] in a transaction, you can have errors if multiple threads are calling this procedure. Also, using GET CONVERSATION GROUP is recommended solely for the cases when the application needs to look up some application specific state (corellated with the converstion_group_id) before actually RECEIVEing the messages. Otherwise one can simply go ahead and call RECEIVE verb directly.

The Service Broker queues are not identical with the Computer Science 'queue' data structures (i.e. a FIFO list). They are more like message storage areas for services and services are endpoints for distributed applications, communicating through messages. As such, using Service Broker queues probably won't map exactly 1-to-1 with your previous user-table-queue code.

You already noticed that you now have to call END CONVERSATION. I'm not sure how the rest of the system works: are you reusing converstions when enqueueing? is each enqueue a separate dialog?. Probably the [queue_dequeue] would have to call END CONVERSATION to end the conversations on which it received the current message. As long as you keep initiator and target on the same queue, this means that the same [queue_dequeue] procedure will have to handle the EndDialog messages (that would be one reason why you should split the queues).

|||Hi Remus, thanks for your reply!

I admit, I used the GET CONVERSATION GROUP when i got desperate.. thought that RECEIVE wasn't locking the queue entry on its own. I'll get rid of it since we don't really need to reuse conversations, we don't have a message intensive system, we are mostly worried about message intergrity and reduced processor load as compared to our previous system.

So, I gather it is better for me to setup my queues in the service broker conventional way, two queues for each physical queue that I have.

I still dont understand fully how the "dont fire and forget" method works:

1. Initiator starts dialog and sends message. does not end dialog.
2. Target receives message and ends dialog which in turn sends EndDialog message to initiator.
3. Initiator activation SP recieves EndDialog message and ends dialog on initiator side?

am I missing something?

thanx once again for your help.

Friday, March 23, 2012

Please guide - especially about Time Dimension and approach in general

Hi,

I have a table which contains all the transaction details for which I am trying to create a CUBE... The explanation below in brackets is only for clarity about each field. Kindly note that I am using the following table as my fact table. Let's call it tblFact

This table contains fields like transaction date, Product (which is sold), Product Family (to which the Product Belongs), Brand (of the product), Qty sold, Unit Price (at which each unit was sold), and some other fields.

I have created a Product dimension based on tblFact. I don't know if this is a good idea or not :confused: Is it okay and am I on the right track or should I base my Product Dimension on some other table (say tblProducts and then in the Cube editor link tblProducts with tblFact based on the ProductID field? Please guide.

Now coming to my last question:
Currently I am also creating my Time Dimension based on tblFact. Is this also a wrong approach?
1. Should I instead create the Time Dimension based on a different table (say tblTime) and again link up tblTime and tblFact in the Cube editor?

2. if yes, then how do I create tblTime from tblFact in a manner that it only contains all the transaction dates.

3. Assuming that I should take the tblTime approach, then should this table (tblTime) also contain the ProductID - representing the Product which was sold for each date in tblTime?

I realize that this is a lenghty post but reply and more importantly guidance from the experienced users will be greatly appreciated becuase I have recently started learning/playing around on the OLAP side of things and I know that this is the time that I get my foundations correct otherwise I'll end up getting used to some bad practice and will find it difficult to change my approach to cube designing later down the road.

So many thanks in advance and I eagerly look forward to reply from someone.No worries mate,

This is what the forum is for...

Ok - Down to what you need to do

When doing the design for a cube I always do a bit of anyalsis first. Looks like you have crack this bit. You know what your dimensions are - Time , product , brand etc This is what you should group on to build your fact table.

Your facts are going to be Qty Sold , Price. This is what you will be suming on with the SQL to build your fact table.

You said "should I base my Product Dimension on some other table (say tblProducts and then in the Cube editor link tblProducts with tblFact based on the ProductID field? Please guide."

This is exactly what a good cube design is based on mate.

I presume the basis of your fact table is a transactions type table.
First thing you need to do is build all your dimension tables.

A table for product dimension table should look something like :

create table tblProduct
(prod_id tinyint ,
prod_txt varchar (255)
)

Don't forget to put in a id for unknown product - just in case you get these in your base transaction table

Build the rest of your dimension tables like this and assign a tinyint composite key to each dimension. What you what is your fact table to be as small as possible in terms of datatypes.

Now once you have this you want to build your fact table.

What you do is take your fact table and join to each of you dimension tables (should be a left outer join) and sum on the qty and unit price and group up accross all your dimensions.

This now should be you fact table that you can reference in Anyalsis Manager. You will have to define all this with in here as well.

Whoo, that was an effort.

Any problems , questions give me a shout

Cheers|||Hi aldo_2003,

Many thanks for the reply. It answers a good number of my questions. Can you kindly advise regarding the remaining question i.e. the quoted portions below:

Build the rest of your dimension tables like this and assign a tinyint composite key to each dimension. What you what is your fact table to be as small as possible in terms of datatypes.
Question: By composite key do you mean define a Primary key in each table? I will do so but was planning to define the data type for my ProductID field (for example) in my tblProducts as int. However I'll follow your advise and instead use the datatype tinyint. Thanks for the tip :)

And my last question hopefully:
Now coming to my last question:
Currently I am also creating my Time Dimension based on tblFact. Is this also a wrong approach?
1. Should I instead create the Time Dimension based on a different table (say tblTime) and again link up tblTime and tblFact in the Cube editor?

2. if yes, then how do I create tblTime from tblFact in a manner that it only contains all the transaction dates.

3. Assuming that I should take the tblTime approach, then should this table (tblTime) also contain the ProductID - representing the Product which was sold for each date in tblTime?

I think Part 2 (above) is easy and all I have to do is make a copy of tblFact but this copy (say tblTime) will only contain the transaction date column (from tblFact). Kindly confirm my understanding.

However it's the answer to part 1 (above) and especially the part 3 above that is requested.

Looking forward to your reply.|||Your welcome,

forgot about the time question

what you want to do is create a tblTime dimension table

should basically have the grandularity that you want to use

you can find scripts on the net that will help you create and manage a time dimension table.

table should look like

create table tblTime
(date_time smalldatetime,
quarter tinyint,
month tinyint,
week tinyint ,
day int
)

populate this table with all the dates in your date range i.e

Jan 1999 12:00am to Jan 2009 12:00am

This is now your time dimension table.
Using anyalsis manager join back on to the fact table.
Anyalsis manager should guide you through the process of creating a time dimension.

Hope this helps

p.s the only reason I used a tinyint is that I assumed you would have no more than 255 products, if you have more then up the datatype to what you need

Cheers|||Hi again,

Don't mean to "over-flatter" but your replies REALLY have been of great help... Here I was trying to build everything (the fact as well as the dimensions) using only a single table and now I am quite clear regarding what's the right approach :)

3 last questions please :rolleyes:

you can find scripts on the net that will help you create and manage a time dimension table.
table should look like

create table tblTime
(date_time smalldatetime,
quarter tinyint,
month tinyint,
week tinyint ,
day int
)

That's another new tip :) Can u kindly guide where I can get these scripts from? I am assuming that these scripts will not only create the table (in a similar structure as you have suggested above) but will also populate the table with the all the desired date ranges e.g. Jan 1999 12:00am to Jan 2009 12:00am.

2nd last question: So my approach which I was assuming for creating tblTime (for the Time dimension was incorrect) i.e. I thought that this will simply contain the ALL the transaction dates from the transaction table (as I described in my previous post). But from your reply my understanding is that this approach is wrong.

and the last question: So the tblTime does not have to store the ProductID?

Sorry for all the botheration.|||No worries buddy

Glad to be of help

The time dimension table is stand alone and does not have to contain any other info other than time info.

I'll see if my collegue know and get back to you

Cheers|||http://www.databasejournal.com/features/mssql/article.php/10894_1466091_6

http://www.winnetmag.com/Article/ArticleID/23142/Windows_23142.html|||Originally posted by aldo_2003
The time dimension table is stand alone and does not have to contain any other info other than time info.

I'll see if my collegue know and get back to you

Cheers

Thanks aldo : for the reply, the link to articles, and also for the clarification about the time dimension's underlying table containing only time related information.

Kindly do let me know if you get a script which not only creates the table for time but also populates it ...

Bless you!|||Check this link http://www.winnetmag.com/Articles/ArticleID/41531/pg/4/4.html for any help.|||Originally posted by Satya
Check this link http://www.winnetmag.com/Articles/ArticleID/41531/pg/4/4.html for any help.

Hi Satya,

Many sincere thanks for the article. I went through it...

However at this stage I am learning the basics (as evident from this post and my questions to aldo) therefore I found he article to be real "HEVY STUFF" and very difficult to digest at the moment. I talks about the MDX world and that's something that I have yet to explore... I know I will have to get into MDX soon but have enough of the basic to get right first :)

I however would like to request if you can help me with another post from me in this forum (Subject" Need help to create my Time Dimension")... it's somewhat related to portion of the discussion in this post but I did not want to unnecessarily prolong this particular thread/post...

Looking forward to your help in my other post.

Thanks again and regards.|||Joozh
Since this has been answered so well i just wanted to chime in and suggest some reading for you

The Data Warehouse Toolkit by Ralph Kimball (http://www.bestwebbuys.com/books/compare/isbn/0471200247/isrc/b-home-search)

this is the second edition of this book revised in 2002 and it is a must have for every OLAP develper.sql

Please evaluate this approach to shrinking log files

I've been tasked with taking over the support for a client's SQL Server
database. I'm not a DBA by profession, I'm a software developer who
uses SQL Server as a database designer.

The clients have reported that the server is running out of disk space
and examination shows that the log files for several of the databases
are at 5Gb or more.

After reading around the subject I suggested the following sequence of
operations:

-- Select the name of the database you want to shrink
USE MyDB

-- Dump unwanted transactions
dump tran MyDB with truncate_only

-- Get the name of the logfile
SELECT * from sysfiles

-- Having examined the rows returned by this use the log file...

-- Shrink the file to required size (in MB)
DBCC SHRINKFILE('MyDB_log', 10)

Is this a reasonable approach? Please bear in mind that I'm pretty new
to this, and I have many other tasks to do besides manage the server.
A previous DBA has set up good maintenance plans etc. so everything is
being properly backed up (well, I think it is)

If this IS a good approach, would it be reasonable to do this on, say,
a monthly basis as a scheduled job? Obviously the step

SELECT * from sysfiles

which gives us the physical name of the log file would be removed and
the job would operate explicitly on each log file for each database in
turn.

Many thanks for reading.

William Balmer.A previous DBA has set up good maintenance plans etc. so everything is

Quote:

Originally Posted by

being properly backed up (well, I think it is)


A common cause of unruly log files is that the database is in the FULL
recovery model but regular transaction log backups are not scheduled. Since
committed transactions won't get removed from the log until the log backup,
manual action is required to reduce the log size.

If you need to minimize data loss, the best approach is to schedule regular
log backups to run periodically between full database backups. You can
include this as part of a database maintenance plan. If more data loss is
acceptable (i.e. your plan is to simply restore from the last full backup),
you can use the SIMPLE recovery model so that committed data are
automatically removed from the log and you don't need to bother with log
backups.

In any case, the log should be sized to accommodate the activity between log
backups (FULL recovery) or the largest transaction (SIMPLE). The high-water
mark of this log space is simply the cost of doing business. IMHO, log file
shrinking should be done only after the log grows due to unusual activity.
I wouldn't schedule log shrinking because automating the process will only
hide the underlying cause.

Keep in mind that the number one responsibility of a DBA (or one that wears
a DBA hat) is to implement and test a backup/recovery plan. If you haven't
already done so, I suggest you run a restore test. You don't want any
surprises when you need to do it for real.

--
Hope this helps.

Dan Guzman
SQL Server MVP

"William" <williambalmer@.hotmail.comwrote in message
news:1157713799.376752.153540@.i3g2000cwc.googlegro ups.com...

Quote:

Originally Posted by

I've been tasked with taking over the support for a client's SQL Server
database. I'm not a DBA by profession, I'm a software developer who
uses SQL Server as a database designer.
>
The clients have reported that the server is running out of disk space
and examination shows that the log files for several of the databases
are at 5Gb or more.
>
After reading around the subject I suggested the following sequence of
operations:
>
-- Select the name of the database you want to shrink
USE MyDB
>
-- Dump unwanted transactions
dump tran MyDB with truncate_only
>
-- Get the name of the logfile
SELECT * from sysfiles
>
-- Having examined the rows returned by this use the log file...
>
-- Shrink the file to required size (in MB)
DBCC SHRINKFILE('MyDB_log', 10)
>
Is this a reasonable approach? Please bear in mind that I'm pretty new
to this, and I have many other tasks to do besides manage the server.
A previous DBA has set up good maintenance plans etc. so everything is
being properly backed up (well, I think it is)
>
If this IS a good approach, would it be reasonable to do this on, say,
a monthly basis as a scheduled job? Obviously the step
>
SELECT * from sysfiles
>
which gives us the physical name of the log file would be removed and
the job would operate explicitly on each log file for each database in
turn.
>
Many thanks for reading.
>
William Balmer.
>

|||Dan Guzman wrote:

snip

Many thanks, Dan - much to think about. I'll read up on the various
topics and may post later if I come up against a brick wall!

Thanks again

William Balmer

Wednesday, March 21, 2012

Please advise on approach

Hi,
Apologies for the vague subject title but I couldn't think of a snappy
title to describe what I am doing. Please bear with me as I explain...
I am developing an application that records data from several
different sources operating at relatively high rates (e.g. >5 times
per sec). The data from the various sources is diverse, in terms of
the type (e.g. data from source A may comprise of 5 parameters in
binary format, whereas data from source B may comprise of a single
parameter in ascii format). The one thing they have in common is a
timestamp. The timestamp although unique for each source is not unique
amongst all sources. For example although there will be only ever be a
single row in TableA with a particular timestamp representing a
particular point in time, there may also be a row in TableB with the
same timestamp. In an attempt to have the writes as efficient as
possible and to prevent the tables growing too big I record data from
each source to different tables. These tables are created at runtime
before I begin to record the data. Following a period of recording
some tables may still have in excess of 500,000 rows. I now wish to
reconstruct the sequence of events in chronological order, i.e.
retrieve the earliest data recorded from all the sources, followed by
the next and so on. I'm wondering how to approach this. I was thinking
of creating a single large table into which I would insert the
timestamp and ID of every row from each source as well as the table
name, then ceating a clustered index on the timestamp field, so that
the rows would be physically sorted in chronlogical order. This table
may look like the following when complete
1 09/22/2004 16:00:00 Table1
2 09/22/2004 16:00:01 Table1
1 09/22/2004 16:00:02 Table2
3 09/22/2004 16:00:03 Table1
2 09/22/2004 16:00:03 Table2
etc.
This would only ever have to happen once (I appreciate it may take
some time to construct) - then each time I need to I could simply move
through the table row by row retrieving the ID and table name which I
could then use as parameters in a query that would retrieve the data.
I'd appreciate any feedback on this approach i.e. is it madness! I was
wondering about views etc. but I don't know if these could help at all
Thanks,
Paul
First, timestamp has no relationship to date or time; the synonym is
rowversion. It is unique for a db and appears to be currently be
implemented in such a way as to be predictable. Based on the current
implementation, one can make comparisons (involving greater than and less
than) that give correct results. However, I don't think the current
implementation can be safely relied on and, IIRC, MS only supports equal /
not equal comparisons. Note - since you are posting in a db-related NG, I
assume that your reference to timestamp is actually a reference to the sql
server datatype. If this assumption is incorrect - well that's your fault
for using an ambiguous term outside of the implied context.
Second, datetime values are accurate to 3 milliseconds. If your term
"timestamp" refers to a datatype of this nature, is this accuracy
sufficient? Note that there are two aspects to accuracy. First is the
relative difference in accuracy between your datasource timestamp values and
this datatype. The second aspect is relative to a difference in accuracy.
Your timestamp data might be more accurate (e.g., 1 ms) but occurs at
intervals that are significantly less accurate (e.g., > 3 ms). Of course,
you could use a different datatype, losing the ability to use the builtin
datetime functions and creating potential ordering problems.
Third, a table is, be definition, unordered. Your narrative assumes and
implies otherwise. Clustering does affect physical ordering. However,
there is nothing that can guarantee you can access rows in physical
insertion order unless that order can be determined by the data itself. In
other words, you must provide a way to include the appropriate information
in an order by clause. Even with a heap and a cursor, I'm not certain that
there is any way to guarantee the desired order.
Given this information, how do you intend to store the data (regardless of
number of tables) in such a way that one can reconstruct the events in
chrono. order? There are two implications in your narrative that might be
problematic. First, is that "reconstruction" implies a single thread of
execution. Is this valid? If not, then you might also need to record
additional information to differentiate the multiple streams of data (which
just might be adding data at the same instant in time). The other potential
pitfall is that reconstruction often implies "re-running" (e.g.,
reconstruction of the data in the same order/timeframe). Fear this!
So now we're down to identifying chronological order. Assume that source A
has data with timestamp 10:22:01 (ignoring the date portion for simplicity)
and source B has data with the same timestamp (you said this was possible).
They both get inserted into the DB (regardless of table) at roughly the same
time. Based on data alone, it is not possible to determine which is first.
What does the DB have to assist? You could use a timestamp for the table
(or tables) and rely on the current implementation to derive order. Somone
recently posted that this type of solution was working. You could use a
datetime column to mark each row with the datetime of insert. This approach
may suffer due to accuracy limitations. Lastly, you could use a single
table with an identity. This should logically work (and it is what you
considered) but may be problematic due to contention / locking. Hotspots
will definitely be something to avoid. Note that this approach is also
predicated on the immediate insertion of data, implying that every data
"event" is recorded on a one-by-one basis. This is not the best approach in
terms of network usage (and does not scale well).
You might want to investigate any potential client-side approaches as well.
In a single-threaded application, you could also impose your own serial
numbering on the data before insertion.
Lastly, you might want to indicate what your ultimate goal is. Relational
databases are often slower than file-based approaches for data of this type
(there is a lot of overhead that you may not need for recording data).
There are other ways to record "data" in fifo order that might later to
imported into a real database for analysis or reporting. If there are no
real relationships in your data, then perhaps you don't need a rdbms.
Without knowing what you intend to do with the information, this is the
approach I would investigate first.
"Paul" <paulsmith5@.hotmail.com> wrote in message
news:ca236fb1.0409220850.2058abc9@.posting.google.c om...
> Hi,
> Apologies for the vague subject title but I couldn't think of a snappy
> title to describe what I am doing. Please bear with me as I explain...
> I am developing an application that records data from several
> different sources operating at relatively high rates (e.g. >5 times
> per sec). The data from the various sources is diverse, in terms of
> the type (e.g. data from source A may comprise of 5 parameters in
> binary format, whereas data from source B may comprise of a single
> parameter in ascii format). The one thing they have in common is a
> timestamp. The timestamp although unique for each source is not unique
> amongst all sources. For example although there will be only ever be a
> single row in TableA with a particular timestamp representing a
> particular point in time, there may also be a row in TableB with the
> same timestamp. In an attempt to have the writes as efficient as
> possible and to prevent the tables growing too big I record data from
> each source to different tables. These tables are created at runtime
> before I begin to record the data. Following a period of recording
> some tables may still have in excess of 500,000 rows. I now wish to
> reconstruct the sequence of events in chronological order, i.e.
> retrieve the earliest data recorded from all the sources, followed by
> the next and so on. I'm wondering how to approach this. I was thinking
> of creating a single large table into which I would insert the
> timestamp and ID of every row from each source as well as the table
> name, then ceating a clustered index on the timestamp field, so that
> the rows would be physically sorted in chronlogical order. This table
> may look like the following when complete
> 1 09/22/2004 16:00:00 Table1
> 2 09/22/2004 16:00:01 Table1
> 1 09/22/2004 16:00:02 Table2
> 3 09/22/2004 16:00:03 Table1
> 2 09/22/2004 16:00:03 Table2
> etc.
> This would only ever have to happen once (I appreciate it may take
> some time to construct) - then each time I need to I could simply move
> through the table row by row retrieving the ID and table name which I
> could then use as parameters in a query that would retrieve the data.
> I'd appreciate any feedback on this approach i.e. is it madness! I was
> wondering about views etc. but I don't know if these could help at all
> Thanks,
> Paul
|||Hi Scott,
Thanks for the reply. Apologies for any confusion arising from my use
of the term timestamp - it does not refer to the sql data type in this
case, rather to the datetime at which the source produced the data -
which is provided by the source itself. Therefore the limitations in
terms of sql server accuracy you refer to do not apply in this case.
Limitations in terms of accuracy emanate from the sources themselves
and I don't wish to concern myself with them at this point as much as
to say that action has been taken to ensure the are synchronised as
much as possible. I did take the precaution (whether right or wrong)
of recording however the millisecond portion of the 'timestamp' to a
separate field to maintain its integrity. I note your point on the
clustered index/ordering thanks. I think that the order in which the
data was produced by the source can be determined from the data itself
- notably the timestamp. The physical insertion order does not concern
me but perhaps if needed I could use an identity field - but if a
single thread is involved as is the case this is a mute point (I
think). Regarding identifying chronological order you are correct when
you ask where can the database assist in differentiating between two
records with the same timestamp/datetimes but I'm thinking it doesn't
have to. I'm only interested in it differentiating between records
with different datetimes i.e. a record with a datetime preceding
another one will be higher up in the order. When I replay the data in
sequence and I encounter two records with the same datetime the delay
in displaying them will be down to how quickly they can be processed
and when you are talking about portions of a second the result will
often be invisible to the naked eye - not perfect but not critical
either. As far as using a non-relational system that's out of the
question at this point.
Again thanks for the reply, I found it useful and appreciate your
time. Perhaps I will refrain from asking such general questions in
future - I think they can raise more questions then they answer - a
problem hard to address through this medium.
Paul
Assume that source A
> has data with timestamp 10:22:01 (ignoring the date portion for simplicity)
> and source B has data with the same timestamp (you said this was possible).
> They both get inserted into the DB (regardless of table) at roughly the same
> time. Based on data alone, it is not possible to determine which is first.
> What does the DB have to assist?
"Scott Morris" <bogus@.bogus.com> wrote in message news:<uNMwmBNoEHA.3324@.TK2MSFTNGP12.phx.gbl>...[vbcol=seagreen]
> First, timestamp has no relationship to date or time; the synonym is
> rowversion. It is unique for a db and appears to be currently be
> implemented in such a way as to be predictable. Based on the current
> implementation, one can make comparisons (involving greater than and less
> than) that give correct results. However, I don't think the current
> implementation can be safely relied on and, IIRC, MS only supports equal /
> not equal comparisons. Note - since you are posting in a db-related NG, I
> assume that your reference to timestamp is actually a reference to the sql
> server datatype. If this assumption is incorrect - well that's your fault
> for using an ambiguous term outside of the implied context.
> Second, datetime values are accurate to 3 milliseconds. If your term
> "timestamp" refers to a datatype of this nature, is this accuracy
> sufficient? Note that there are two aspects to accuracy. First is the
> relative difference in accuracy between your datasource timestamp values and
> this datatype. The second aspect is relative to a difference in accuracy.
> Your timestamp data might be more accurate (e.g., 1 ms) but occurs at
> intervals that are significantly less accurate (e.g., > 3 ms). Of course,
> you could use a different datatype, losing the ability to use the builtin
> datetime functions and creating potential ordering problems.
> Third, a table is, be definition, unordered. Your narrative assumes and
> implies otherwise. Clustering does affect physical ordering. However,
> there is nothing that can guarantee you can access rows in physical
> insertion order unless that order can be determined by the data itself. In
> other words, you must provide a way to include the appropriate information
> in an order by clause. Even with a heap and a cursor, I'm not certain that
> there is any way to guarantee the desired order.
> Given this information, how do you intend to store the data (regardless of
> number of tables) in such a way that one can reconstruct the events in
> chrono. order? There are two implications in your narrative that might be
> problematic. First, is that "reconstruction" implies a single thread of
> execution. Is this valid? If not, then you might also need to record
> additional information to differentiate the multiple streams of data (which
> just might be adding data at the same instant in time). The other potential
> pitfall is that reconstruction often implies "re-running" (e.g.,
> reconstruction of the data in the same order/timeframe). Fear this!
> So now we're down to identifying chronological order. Assume that source A
> has data with timestamp 10:22:01 (ignoring the date portion for simplicity)
> and source B has data with the same timestamp (you said this was possible).
> They both get inserted into the DB (regardless of table) at roughly the same
> time. Based on data alone, it is not possible to determine which is first.
> What does the DB have to assist? You could use a timestamp for the table
> (or tables) and rely on the current implementation to derive order. Somone
> recently posted that this type of solution was working. You could use a
> datetime column to mark each row with the datetime of insert. This approach
> may suffer due to accuracy limitations. Lastly, you could use a single
> table with an identity. This should logically work (and it is what you
> considered) but may be problematic due to contention / locking. Hotspots
> will definitely be something to avoid. Note that this approach is also
> predicated on the immediate insertion of data, implying that every data
> "event" is recorded on a one-by-one basis. This is not the best approach in
> terms of network usage (and does not scale well).
> You might want to investigate any potential client-side approaches as well.
> In a single-threaded application, you could also impose your own serial
> numbering on the data before insertion.
> Lastly, you might want to indicate what your ultimate goal is. Relational
> databases are often slower than file-based approaches for data of this type
> (there is a lot of overhead that you may not need for recording data).
> There are other ways to record "data" in fifo order that might later to
> imported into a real database for analysis or reporting. If there are no
> real relationships in your data, then perhaps you don't need a rdbms.
> Without knowing what you intend to do with the information, this is the
> approach I would investigate first.
> "Paul" <paulsmith5@.hotmail.com> wrote in message
> news:ca236fb1.0409220850.2058abc9@.posting.google.c om...
|||It's very difficult to answer general questions in a NG. Context usually
has direct bearing on the issues and the context is difficult to communicate
(as well as lengthy). From the short discussion, it looks like you need to
conduct a thorough analysis of the data (a good bit you've already done).
The characteristics of the data will lead (or force!) you to the appropriate
design. Just be aware of the datatype issues (some of which I mentioned)
and their potential effects on your system. One last comment - you might
want to conduct some testing involving a typically loaded machine / network
to verify that it can handle the expected volume of inserts. Better to know
this ahead of time (where you can design around it) then discover it
afterwards.
Good luck.
"Paul" <paulsmith5@.hotmail.com> wrote in message
news:ca236fb1.0409230256.6a4876b0@.posting.google.c om...[vbcol=seagreen]
> Hi Scott,
> Thanks for the reply. Apologies for any confusion arising from my use
> of the term timestamp - it does not refer to the sql data type in this
> case, rather to the datetime at which the source produced the data -
> which is provided by the source itself. Therefore the limitations in
> terms of sql server accuracy you refer to do not apply in this case.
> Limitations in terms of accuracy emanate from the sources themselves
> and I don't wish to concern myself with them at this point as much as
> to say that action has been taken to ensure the are synchronised as
> much as possible. I did take the precaution (whether right or wrong)
> of recording however the millisecond portion of the 'timestamp' to a
> separate field to maintain its integrity. I note your point on the
> clustered index/ordering thanks. I think that the order in which the
> data was produced by the source can be determined from the data itself
> - notably the timestamp. The physical insertion order does not concern
> me but perhaps if needed I could use an identity field - but if a
> single thread is involved as is the case this is a mute point (I
> think). Regarding identifying chronological order you are correct when
> you ask where can the database assist in differentiating between two
> records with the same timestamp/datetimes but I'm thinking it doesn't
> have to. I'm only interested in it differentiating between records
> with different datetimes i.e. a record with a datetime preceding
> another one will be higher up in the order. When I replay the data in
> sequence and I encounter two records with the same datetime the delay
> in displaying them will be down to how quickly they can be processed
> and when you are talking about portions of a second the result will
> often be invisible to the naked eye - not perfect but not critical
> either. As far as using a non-relational system that's out of the
> question at this point.
> Again thanks for the reply, I found it useful and appreciate your
> time. Perhaps I will refrain from asking such general questions in
> future - I think they can raise more questions then they answer - a
> problem hard to address through this medium.
> Paul
>
> Assume that source A
simplicity)[vbcol=seagreen]
possible).[vbcol=seagreen]
same[vbcol=seagreen]
first.
> "Scott Morris" <bogus@.bogus.com> wrote in message
news:<uNMwmBNoEHA.3324@.TK2MSFTNGP12.phx.gbl>...[vbcol=seagreen]
less[vbcol=seagreen]
/[vbcol=seagreen]
I[vbcol=seagreen]
sql[vbcol=seagreen]
fault[vbcol=seagreen]
and[vbcol=seagreen]
accuracy.[vbcol=seagreen]
course,[vbcol=seagreen]
builtin[vbcol=seagreen]
In[vbcol=seagreen]
information[vbcol=seagreen]
that[vbcol=seagreen]
of[vbcol=seagreen]
be[vbcol=seagreen]
(which[vbcol=seagreen]
potential[vbcol=seagreen]
source A[vbcol=seagreen]
simplicity)[vbcol=seagreen]
possible).[vbcol=seagreen]
same[vbcol=seagreen]
first.[vbcol=seagreen]
table[vbcol=seagreen]
Somone[vbcol=seagreen]
approach[vbcol=seagreen]
Hotspots[vbcol=seagreen]
approach in[vbcol=seagreen]
well.[vbcol=seagreen]
Relational[vbcol=seagreen]
type[vbcol=seagreen]
no[vbcol=seagreen]

Please advise on approach

Hi,
Apologies for the vague subject title but I couldn't think of a snappy
title to describe what I am doing. Please bear with me as I explain...
I am developing an application that records data from several
different sources operating at relatively high rates (e.g. >5 times
per sec). The data from the various sources is diverse, in terms of
the type (e.g. data from source A may comprise of 5 parameters in
binary format, whereas data from source B may comprise of a single
parameter in ascii format). The one thing they have in common is a
timestamp. The timestamp although unique for each source is not unique
amongst all sources. For example although there will be only ever be a
single row in TableA with a particular timestamp representing a
particular point in time, there may also be a row in TableB with the
same timestamp. In an attempt to have the writes as efficient as
possible and to prevent the tables growing too big I record data from
each source to different tables. These tables are created at runtime
before I begin to record the data. Following a period of recording
some tables may still have in excess of 500,000 rows. I now wish to
reconstruct the sequence of events in chronological order, i.e.
retrieve the earliest data recorded from all the sources, followed by
the next and so on. I'm wondering how to approach this. I was thinking
of creating a single large table into which I would insert the
timestamp and ID of every row from each source as well as the table
name, then ceating a clustered index on the timestamp field, so that
the rows would be physically sorted in chronlogical order. This table
may look like the following when complete
1 09/22/2004 16:00:00 Table1
2 09/22/2004 16:00:01 Table1
1 09/22/2004 16:00:02 Table2
3 09/22/2004 16:00:03 Table1
2 09/22/2004 16:00:03 Table2
etc.
This would only ever have to happen once (I appreciate it may take
some time to construct) - then each time I need to I could simply move
through the table row by row retrieving the ID and table name which I
could then use as parameters in a query that would retrieve the data.
I'd appreciate any feedback on this approach i.e. is it madness! I was
wondering about views etc. but I don't know if these could help at all
Thanks,
PaulFirst, timestamp has no relationship to date or time; the synonym is
rowversion. It is unique for a db and appears to be currently be
implemented in such a way as to be predictable. Based on the current
implementation, one can make comparisons (involving greater than and less
than) that give correct results. However, I don't think the current
implementation can be safely relied on and, IIRC, MS only supports equal /
not equal comparisons. Note - since you are posting in a db-related NG, I
assume that your reference to timestamp is actually a reference to the sql
server datatype. If this assumption is incorrect - well that's your fault
for using an ambiguous term outside of the implied context.
Second, datetime values are accurate to 3 milliseconds. If your term
"timestamp" refers to a datatype of this nature, is this accuracy
sufficient? Note that there are two aspects to accuracy. First is the
relative difference in accuracy between your datasource timestamp values and
this datatype. The second aspect is relative to a difference in accuracy.
Your timestamp data might be more accurate (e.g., 1 ms) but occurs at
intervals that are significantly less accurate (e.g., > 3 ms). Of course,
you could use a different datatype, losing the ability to use the builtin
datetime functions and creating potential ordering problems.
Third, a table is, be definition, unordered. Your narrative assumes and
implies otherwise. Clustering does affect physical ordering. However,
there is nothing that can guarantee you can access rows in physical
insertion order unless that order can be determined by the data itself. In
other words, you must provide a way to include the appropriate information
in an order by clause. Even with a heap and a cursor, I'm not certain that
there is any way to guarantee the desired order.
Given this information, how do you intend to store the data (regardless of
number of tables) in such a way that one can reconstruct the events in
chrono. order? There are two implications in your narrative that might be
problematic. First, is that "reconstruction" implies a single thread of
execution. Is this valid? If not, then you might also need to record
additional information to differentiate the multiple streams of data (which
just might be adding data at the same instant in time). The other potential
pitfall is that reconstruction often implies "re-running" (e.g.,
reconstruction of the data in the same order/timeframe). Fear this!
So now we're down to identifying chronological order. Assume that source A
has data with timestamp 10:22:01 (ignoring the date portion for simplicity)
and source B has data with the same timestamp (you said this was possible).
They both get inserted into the DB (regardless of table) at roughly the same
time. Based on data alone, it is not possible to determine which is first.
What does the DB have to assist? You could use a timestamp for the table
(or tables) and rely on the current implementation to derive order. Somone
recently posted that this type of solution was working. You could use a
datetime column to mark each row with the datetime of insert. This approach
may suffer due to accuracy limitations. Lastly, you could use a single
table with an identity. This should logically work (and it is what you
considered) but may be problematic due to contention / locking. Hotspots
will definitely be something to avoid. Note that this approach is also
predicated on the immediate insertion of data, implying that every data
"event" is recorded on a one-by-one basis. This is not the best approach in
terms of network usage (and does not scale well).
You might want to investigate any potential client-side approaches as well.
In a single-threaded application, you could also impose your own serial
numbering on the data before insertion.
Lastly, you might want to indicate what your ultimate goal is. Relational
databases are often slower than file-based approaches for data of this type
(there is a lot of overhead that you may not need for recording data).
There are other ways to record "data" in fifo order that might later to
imported into a real database for analysis or reporting. If there are no
real relationships in your data, then perhaps you don't need a rdbms.
Without knowing what you intend to do with the information, this is the
approach I would investigate first.
"Paul" <paulsmith5@.hotmail.com> wrote in message
news:ca236fb1.0409220850.2058abc9@.posting.google.com...
> Hi,
> Apologies for the vague subject title but I couldn't think of a snappy
> title to describe what I am doing. Please bear with me as I explain...
> I am developing an application that records data from several
> different sources operating at relatively high rates (e.g. >5 times
> per sec). The data from the various sources is diverse, in terms of
> the type (e.g. data from source A may comprise of 5 parameters in
> binary format, whereas data from source B may comprise of a single
> parameter in ascii format). The one thing they have in common is a
> timestamp. The timestamp although unique for each source is not unique
> amongst all sources. For example although there will be only ever be a
> single row in TableA with a particular timestamp representing a
> particular point in time, there may also be a row in TableB with the
> same timestamp. In an attempt to have the writes as efficient as
> possible and to prevent the tables growing too big I record data from
> each source to different tables. These tables are created at runtime
> before I begin to record the data. Following a period of recording
> some tables may still have in excess of 500,000 rows. I now wish to
> reconstruct the sequence of events in chronological order, i.e.
> retrieve the earliest data recorded from all the sources, followed by
> the next and so on. I'm wondering how to approach this. I was thinking
> of creating a single large table into which I would insert the
> timestamp and ID of every row from each source as well as the table
> name, then ceating a clustered index on the timestamp field, so that
> the rows would be physically sorted in chronlogical order. This table
> may look like the following when complete
> 1 09/22/2004 16:00:00 Table1
> 2 09/22/2004 16:00:01 Table1
> 1 09/22/2004 16:00:02 Table2
> 3 09/22/2004 16:00:03 Table1
> 2 09/22/2004 16:00:03 Table2
> etc.
> This would only ever have to happen once (I appreciate it may take
> some time to construct) - then each time I need to I could simply move
> through the table row by row retrieving the ID and table name which I
> could then use as parameters in a query that would retrieve the data.
> I'd appreciate any feedback on this approach i.e. is it madness! I was
> wondering about views etc. but I don't know if these could help at all
> Thanks,
> Paul|||Hi Scott,
Thanks for the reply. Apologies for any confusion arising from my use
of the term timestamp - it does not refer to the sql data type in this
case, rather to the datetime at which the source produced the data -
which is provided by the source itself. Therefore the limitations in
terms of sql server accuracy you refer to do not apply in this case.
Limitations in terms of accuracy emanate from the sources themselves
and I don't wish to concern myself with them at this point as much as
to say that action has been taken to ensure the are synchronised as
much as possible. I did take the precaution (whether right or wrong)
of recording however the millisecond portion of the 'timestamp' to a
separate field to maintain its integrity. I note your point on the
clustered index/ordering thanks. I think that the order in which the
data was produced by the source can be determined from the data itself
- notably the timestamp. The physical insertion order does not concern
me but perhaps if needed I could use an identity field - but if a
single thread is involved as is the case this is a mute point (I
think). Regarding identifying chronological order you are correct when
you ask where can the database assist in differentiating between two
records with the same timestamp/datetimes but I'm thinking it doesn't
have to. I'm only interested in it differentiating between records
with different datetimes i.e. a record with a datetime preceding
another one will be higher up in the order. When I replay the data in
sequence and I encounter two records with the same datetime the delay
in displaying them will be down to how quickly they can be processed
and when you are talking about portions of a second the result will
often be invisible to the naked eye - not perfect but not critical
either. As far as using a non-relational system that's out of the
question at this point.
Again thanks for the reply, I found it useful and appreciate your
time. Perhaps I will refrain from asking such general questions in
future - I think they can raise more questions then they answer - a
problem hard to address through this medium.
Paul
Assume that source A
> has data with timestamp 10:22:01 (ignoring the date portion for simplicity)
> and source B has data with the same timestamp (you said this was possible).
> They both get inserted into the DB (regardless of table) at roughly the same
> time. Based on data alone, it is not possible to determine which is first.
> What does the DB have to assist?
"Scott Morris" <bogus@.bogus.com> wrote in message news:<uNMwmBNoEHA.3324@.TK2MSFTNGP12.phx.gbl>...
> First, timestamp has no relationship to date or time; the synonym is
> rowversion. It is unique for a db and appears to be currently be
> implemented in such a way as to be predictable. Based on the current
> implementation, one can make comparisons (involving greater than and less
> than) that give correct results. However, I don't think the current
> implementation can be safely relied on and, IIRC, MS only supports equal /
> not equal comparisons. Note - since you are posting in a db-related NG, I
> assume that your reference to timestamp is actually a reference to the sql
> server datatype. If this assumption is incorrect - well that's your fault
> for using an ambiguous term outside of the implied context.
> Second, datetime values are accurate to 3 milliseconds. If your term
> "timestamp" refers to a datatype of this nature, is this accuracy
> sufficient? Note that there are two aspects to accuracy. First is the
> relative difference in accuracy between your datasource timestamp values and
> this datatype. The second aspect is relative to a difference in accuracy.
> Your timestamp data might be more accurate (e.g., 1 ms) but occurs at
> intervals that are significantly less accurate (e.g., > 3 ms). Of course,
> you could use a different datatype, losing the ability to use the builtin
> datetime functions and creating potential ordering problems.
> Third, a table is, be definition, unordered. Your narrative assumes and
> implies otherwise. Clustering does affect physical ordering. However,
> there is nothing that can guarantee you can access rows in physical
> insertion order unless that order can be determined by the data itself. In
> other words, you must provide a way to include the appropriate information
> in an order by clause. Even with a heap and a cursor, I'm not certain that
> there is any way to guarantee the desired order.
> Given this information, how do you intend to store the data (regardless of
> number of tables) in such a way that one can reconstruct the events in
> chrono. order? There are two implications in your narrative that might be
> problematic. First, is that "reconstruction" implies a single thread of
> execution. Is this valid? If not, then you might also need to record
> additional information to differentiate the multiple streams of data (which
> just might be adding data at the same instant in time). The other potential
> pitfall is that reconstruction often implies "re-running" (e.g.,
> reconstruction of the data in the same order/timeframe). Fear this!
> So now we're down to identifying chronological order. Assume that source A
> has data with timestamp 10:22:01 (ignoring the date portion for simplicity)
> and source B has data with the same timestamp (you said this was possible).
> They both get inserted into the DB (regardless of table) at roughly the same
> time. Based on data alone, it is not possible to determine which is first.
> What does the DB have to assist? You could use a timestamp for the table
> (or tables) and rely on the current implementation to derive order. Somone
> recently posted that this type of solution was working. You could use a
> datetime column to mark each row with the datetime of insert. This approach
> may suffer due to accuracy limitations. Lastly, you could use a single
> table with an identity. This should logically work (and it is what you
> considered) but may be problematic due to contention / locking. Hotspots
> will definitely be something to avoid. Note that this approach is also
> predicated on the immediate insertion of data, implying that every data
> "event" is recorded on a one-by-one basis. This is not the best approach in
> terms of network usage (and does not scale well).
> You might want to investigate any potential client-side approaches as well.
> In a single-threaded application, you could also impose your own serial
> numbering on the data before insertion.
> Lastly, you might want to indicate what your ultimate goal is. Relational
> databases are often slower than file-based approaches for data of this type
> (there is a lot of overhead that you may not need for recording data).
> There are other ways to record "data" in fifo order that might later to
> imported into a real database for analysis or reporting. If there are no
> real relationships in your data, then perhaps you don't need a rdbms.
> Without knowing what you intend to do with the information, this is the
> approach I would investigate first.
> "Paul" <paulsmith5@.hotmail.com> wrote in message
> news:ca236fb1.0409220850.2058abc9@.posting.google.com...
> > Hi,
> >
> > Apologies for the vague subject title but I couldn't think of a snappy
> > title to describe what I am doing. Please bear with me as I explain...
> >
> > I am developing an application that records data from several
> > different sources operating at relatively high rates (e.g. >5 times
> > per sec). The data from the various sources is diverse, in terms of
> > the type (e.g. data from source A may comprise of 5 parameters in
> > binary format, whereas data from source B may comprise of a single
> > parameter in ascii format). The one thing they have in common is a
> > timestamp. The timestamp although unique for each source is not unique
> > amongst all sources. For example although there will be only ever be a
> > single row in TableA with a particular timestamp representing a
> > particular point in time, there may also be a row in TableB with the
> > same timestamp. In an attempt to have the writes as efficient as
> > possible and to prevent the tables growing too big I record data from
> > each source to different tables. These tables are created at runtime
> > before I begin to record the data. Following a period of recording
> > some tables may still have in excess of 500,000 rows. I now wish to
> > reconstruct the sequence of events in chronological order, i.e.
> > retrieve the earliest data recorded from all the sources, followed by
> > the next and so on. I'm wondering how to approach this. I was thinking
> > of creating a single large table into which I would insert the
> > timestamp and ID of every row from each source as well as the table
> > name, then ceating a clustered index on the timestamp field, so that
> > the rows would be physically sorted in chronlogical order. This table
> > may look like the following when complete
> >
> > 1 09/22/2004 16:00:00 Table1
> > 2 09/22/2004 16:00:01 Table1
> > 1 09/22/2004 16:00:02 Table2
> > 3 09/22/2004 16:00:03 Table1
> > 2 09/22/2004 16:00:03 Table2
> > etc.
> >
> > This would only ever have to happen once (I appreciate it may take
> > some time to construct) - then each time I need to I could simply move
> > through the table row by row retrieving the ID and table name which I
> > could then use as parameters in a query that would retrieve the data.
> > I'd appreciate any feedback on this approach i.e. is it madness! I was
> > wondering about views etc. but I don't know if these could help at all
> >
> > Thanks,
> >
> > Paul|||It's very difficult to answer general questions in a NG. Context usually
has direct bearing on the issues and the context is difficult to communicate
(as well as lengthy). From the short discussion, it looks like you need to
conduct a thorough analysis of the data (a good bit you've already done).
The characteristics of the data will lead (or force!) you to the appropriate
design. Just be aware of the datatype issues (some of which I mentioned)
and their potential effects on your system. One last comment - you might
want to conduct some testing involving a typically loaded machine / network
to verify that it can handle the expected volume of inserts. Better to know
this ahead of time (where you can design around it) then discover it
afterwards.
Good luck.
"Paul" <paulsmith5@.hotmail.com> wrote in message
news:ca236fb1.0409230256.6a4876b0@.posting.google.com...
> Hi Scott,
> Thanks for the reply. Apologies for any confusion arising from my use
> of the term timestamp - it does not refer to the sql data type in this
> case, rather to the datetime at which the source produced the data -
> which is provided by the source itself. Therefore the limitations in
> terms of sql server accuracy you refer to do not apply in this case.
> Limitations in terms of accuracy emanate from the sources themselves
> and I don't wish to concern myself with them at this point as much as
> to say that action has been taken to ensure the are synchronised as
> much as possible. I did take the precaution (whether right or wrong)
> of recording however the millisecond portion of the 'timestamp' to a
> separate field to maintain its integrity. I note your point on the
> clustered index/ordering thanks. I think that the order in which the
> data was produced by the source can be determined from the data itself
> - notably the timestamp. The physical insertion order does not concern
> me but perhaps if needed I could use an identity field - but if a
> single thread is involved as is the case this is a mute point (I
> think). Regarding identifying chronological order you are correct when
> you ask where can the database assist in differentiating between two
> records with the same timestamp/datetimes but I'm thinking it doesn't
> have to. I'm only interested in it differentiating between records
> with different datetimes i.e. a record with a datetime preceding
> another one will be higher up in the order. When I replay the data in
> sequence and I encounter two records with the same datetime the delay
> in displaying them will be down to how quickly they can be processed
> and when you are talking about portions of a second the result will
> often be invisible to the naked eye - not perfect but not critical
> either. As far as using a non-relational system that's out of the
> question at this point.
> Again thanks for the reply, I found it useful and appreciate your
> time. Perhaps I will refrain from asking such general questions in
> future - I think they can raise more questions then they answer - a
> problem hard to address through this medium.
> Paul
>
> Assume that source A
> > has data with timestamp 10:22:01 (ignoring the date portion for
simplicity)
> > and source B has data with the same timestamp (you said this was
possible).
> > They both get inserted into the DB (regardless of table) at roughly the
same
> > time. Based on data alone, it is not possible to determine which is
first.
> > What does the DB have to assist?
> "Scott Morris" <bogus@.bogus.com> wrote in message
news:<uNMwmBNoEHA.3324@.TK2MSFTNGP12.phx.gbl>...
> > First, timestamp has no relationship to date or time; the synonym is
> > rowversion. It is unique for a db and appears to be currently be
> > implemented in such a way as to be predictable. Based on the current
> > implementation, one can make comparisons (involving greater than and
less
> > than) that give correct results. However, I don't think the current
> > implementation can be safely relied on and, IIRC, MS only supports equal
/
> > not equal comparisons. Note - since you are posting in a db-related NG,
I
> > assume that your reference to timestamp is actually a reference to the
sql
> > server datatype. If this assumption is incorrect - well that's your
fault
> > for using an ambiguous term outside of the implied context.
> >
> > Second, datetime values are accurate to 3 milliseconds. If your term
> > "timestamp" refers to a datatype of this nature, is this accuracy
> > sufficient? Note that there are two aspects to accuracy. First is the
> > relative difference in accuracy between your datasource timestamp values
and
> > this datatype. The second aspect is relative to a difference in
accuracy.
> > Your timestamp data might be more accurate (e.g., 1 ms) but occurs at
> > intervals that are significantly less accurate (e.g., > 3 ms). Of
course,
> > you could use a different datatype, losing the ability to use the
builtin
> > datetime functions and creating potential ordering problems.
> >
> > Third, a table is, be definition, unordered. Your narrative assumes and
> > implies otherwise. Clustering does affect physical ordering. However,
> > there is nothing that can guarantee you can access rows in physical
> > insertion order unless that order can be determined by the data itself.
In
> > other words, you must provide a way to include the appropriate
information
> > in an order by clause. Even with a heap and a cursor, I'm not certain
that
> > there is any way to guarantee the desired order.
> >
> > Given this information, how do you intend to store the data (regardless
of
> > number of tables) in such a way that one can reconstruct the events in
> > chrono. order? There are two implications in your narrative that might
be
> > problematic. First, is that "reconstruction" implies a single thread of
> > execution. Is this valid? If not, then you might also need to record
> > additional information to differentiate the multiple streams of data
(which
> > just might be adding data at the same instant in time). The other
potential
> > pitfall is that reconstruction often implies "re-running" (e.g.,
> > reconstruction of the data in the same order/timeframe). Fear this!
> >
> > So now we're down to identifying chronological order. Assume that
source A
> > has data with timestamp 10:22:01 (ignoring the date portion for
simplicity)
> > and source B has data with the same timestamp (you said this was
possible).
> > They both get inserted into the DB (regardless of table) at roughly the
same
> > time. Based on data alone, it is not possible to determine which is
first.
> > What does the DB have to assist? You could use a timestamp for the
table
> > (or tables) and rely on the current implementation to derive order.
Somone
> > recently posted that this type of solution was working. You could use a
> > datetime column to mark each row with the datetime of insert. This
approach
> > may suffer due to accuracy limitations. Lastly, you could use a single
> > table with an identity. This should logically work (and it is what you
> > considered) but may be problematic due to contention / locking.
Hotspots
> > will definitely be something to avoid. Note that this approach is also
> > predicated on the immediate insertion of data, implying that every data
> > "event" is recorded on a one-by-one basis. This is not the best
approach in
> > terms of network usage (and does not scale well).
> >
> > You might want to investigate any potential client-side approaches as
well.
> > In a single-threaded application, you could also impose your own serial
> > numbering on the data before insertion.
> >
> > Lastly, you might want to indicate what your ultimate goal is.
Relational
> > databases are often slower than file-based approaches for data of this
type
> > (there is a lot of overhead that you may not need for recording data).
> > There are other ways to record "data" in fifo order that might later to
> > imported into a real database for analysis or reporting. If there are
no
> > real relationships in your data, then perhaps you don't need a rdbms.
> > Without knowing what you intend to do with the information, this is the
> > approach I would investigate first.
> >
> > "Paul" <paulsmith5@.hotmail.com> wrote in message
> > news:ca236fb1.0409220850.2058abc9@.posting.google.com...
> > > Hi,
> > >
> > > Apologies for the vague subject title but I couldn't think of a snappy
> > > title to describe what I am doing. Please bear with me as I explain...
> > >
> > > I am developing an application that records data from several
> > > different sources operating at relatively high rates (e.g. >5 times
> > > per sec). The data from the various sources is diverse, in terms of
> > > the type (e.g. data from source A may comprise of 5 parameters in
> > > binary format, whereas data from source B may comprise of a single
> > > parameter in ascii format). The one thing they have in common is a
> > > timestamp. The timestamp although unique for each source is not unique
> > > amongst all sources. For example although there will be only ever be a
> > > single row in TableA with a particular timestamp representing a
> > > particular point in time, there may also be a row in TableB with the
> > > same timestamp. In an attempt to have the writes as efficient as
> > > possible and to prevent the tables growing too big I record data from
> > > each source to different tables. These tables are created at runtime
> > > before I begin to record the data. Following a period of recording
> > > some tables may still have in excess of 500,000 rows. I now wish to
> > > reconstruct the sequence of events in chronological order, i.e.
> > > retrieve the earliest data recorded from all the sources, followed by
> > > the next and so on. I'm wondering how to approach this. I was thinking
> > > of creating a single large table into which I would insert the
> > > timestamp and ID of every row from each source as well as the table
> > > name, then ceating a clustered index on the timestamp field, so that
> > > the rows would be physically sorted in chronlogical order. This table
> > > may look like the following when complete
> > >
> > > 1 09/22/2004 16:00:00 Table1
> > > 2 09/22/2004 16:00:01 Table1
> > > 1 09/22/2004 16:00:02 Table2
> > > 3 09/22/2004 16:00:03 Table1
> > > 2 09/22/2004 16:00:03 Table2
> > > etc.
> > >
> > > This would only ever have to happen once (I appreciate it may take
> > > some time to construct) - then each time I need to I could simply move
> > > through the table row by row retrieving the ID and table name which I
> > > could then use as parameters in a query that would retrieve the data.
> > > I'd appreciate any feedback on this approach i.e. is it madness! I was
> > > wondering about views etc. but I don't know if these could help at all
> > >
> > > Thanks,
> > >
> > > Paul