Hi,
Apologies for the vague subject title but I couldn't think of a snappy
title to describe what I am doing. Please bear with me as I explain...
I am developing an application that records data from several
different sources operating at relatively high rates (e.g. >5 times
per sec). The data from the various sources is diverse, in terms of
the type (e.g. data from source A may comprise of 5 parameters in
binary format, whereas data from source B may comprise of a single
parameter in ascii format). The one thing they have in common is a
timestamp. The timestamp although unique for each source is not unique
amongst all sources. For example although there will be only ever be a
single row in TableA with a particular timestamp representing a
particular point in time, there may also be a row in TableB with the
same timestamp. In an attempt to have the writes as efficient as
possible and to prevent the tables growing too big I record data from
each source to different tables. These tables are created at runtime
before I begin to record the data. Following a period of recording
some tables may still have in excess of 500,000 rows. I now wish to
reconstruct the sequence of events in chronological order, i.e.
retrieve the earliest data recorded from all the sources, followed by
the next and so on. I'm wondering how to approach this. I was thinking
of creating a single large table into which I would insert the
timestamp and ID of every row from each source as well as the table
name, then ceating a clustered index on the timestamp field, so that
the rows would be physically sorted in chronlogical order. This table
may look like the following when complete
1 09/22/2004 16:00:00 Table1
2 09/22/2004 16:00:01 Table1
1 09/22/2004 16:00:02 Table2
3 09/22/2004 16:00:03 Table1
2 09/22/2004 16:00:03 Table2
etc.
This would only ever have to happen once (I appreciate it may take
some time to construct) - then each time I need to I could simply move
through the table row by row retrieving the ID and table name which I
could then use as parameters in a query that would retrieve the data.
I'd appreciate any feedback on this approach i.e. is it madness! I was
wondering about views etc. but I don't know if these could help at all
Thanks,
PaulFirst, timestamp has no relationship to date or time; the synonym is
rowversion. It is unique for a db and appears to be currently be
implemented in such a way as to be predictable. Based on the current
implementation, one can make comparisons (involving greater than and less
than) that give correct results. However, I don't think the current
implementation can be safely relied on and, IIRC, MS only supports equal /
not equal comparisons. Note - since you are posting in a db-related NG, I
assume that your reference to timestamp is actually a reference to the sql
server datatype. If this assumption is incorrect - well that's your fault
for using an ambiguous term outside of the implied context.
Second, datetime values are accurate to 3 milliseconds. If your term
"timestamp" refers to a datatype of this nature, is this accuracy
sufficient? Note that there are two aspects to accuracy. First is the
relative difference in accuracy between your datasource timestamp values and
this datatype. The second aspect is relative to a difference in accuracy.
Your timestamp data might be more accurate (e.g., 1 ms) but occurs at
intervals that are significantly less accurate (e.g., > 3 ms). Of course,
you could use a different datatype, losing the ability to use the builtin
datetime functions and creating potential ordering problems.
Third, a table is, be definition, unordered. Your narrative assumes and
implies otherwise. Clustering does affect physical ordering. However,
there is nothing that can guarantee you can access rows in physical
insertion order unless that order can be determined by the data itself. In
other words, you must provide a way to include the appropriate information
in an order by clause. Even with a heap and a cursor, I'm not certain that
there is any way to guarantee the desired order.
Given this information, how do you intend to store the data (regardless of
number of tables) in such a way that one can reconstruct the events in
chrono. order? There are two implications in your narrative that might be
problematic. First, is that "reconstruction" implies a single thread of
execution. Is this valid? If not, then you might also need to record
additional information to differentiate the multiple streams of data (which
just might be adding data at the same instant in time). The other potential
pitfall is that reconstruction often implies "re-running" (e.g.,
reconstruction of the data in the same order/timeframe). Fear this!
So now we're down to identifying chronological order. Assume that source A
has data with timestamp 10:22:01 (ignoring the date portion for simplicity)
and source B has data with the same timestamp (you said this was possible).
They both get inserted into the DB (regardless of table) at roughly the same
time. Based on data alone, it is not possible to determine which is first.
What does the DB have to assist? You could use a timestamp for the table
(or tables) and rely on the current implementation to derive order. Somone
recently posted that this type of solution was working. You could use a
datetime column to mark each row with the datetime of insert. This approach
may suffer due to accuracy limitations. Lastly, you could use a single
table with an identity. This should logically work (and it is what you
considered) but may be problematic due to contention / locking. Hotspots
will definitely be something to avoid. Note that this approach is also
predicated on the immediate insertion of data, implying that every data
"event" is recorded on a one-by-one basis. This is not the best approach in
terms of network usage (and does not scale well).
You might want to investigate any potential client-side approaches as well.
In a single-threaded application, you could also impose your own serial
numbering on the data before insertion.
Lastly, you might want to indicate what your ultimate goal is. Relational
databases are often slower than file-based approaches for data of this type
(there is a lot of overhead that you may not need for recording data).
There are other ways to record "data" in fifo order that might later to
imported into a real database for analysis or reporting. If there are no
real relationships in your data, then perhaps you don't need a rdbms.
Without knowing what you intend to do with the information, this is the
approach I would investigate first.
"Paul" <paulsmith5@.hotmail.com> wrote in message
news:ca236fb1.0409220850.2058abc9@.posting.google.com...
> Hi,
> Apologies for the vague subject title but I couldn't think of a snappy
> title to describe what I am doing. Please bear with me as I explain...
> I am developing an application that records data from several
> different sources operating at relatively high rates (e.g. >5 times
> per sec). The data from the various sources is diverse, in terms of
> the type (e.g. data from source A may comprise of 5 parameters in
> binary format, whereas data from source B may comprise of a single
> parameter in ascii format). The one thing they have in common is a
> timestamp. The timestamp although unique for each source is not unique
> amongst all sources. For example although there will be only ever be a
> single row in TableA with a particular timestamp representing a
> particular point in time, there may also be a row in TableB with the
> same timestamp. In an attempt to have the writes as efficient as
> possible and to prevent the tables growing too big I record data from
> each source to different tables. These tables are created at runtime
> before I begin to record the data. Following a period of recording
> some tables may still have in excess of 500,000 rows. I now wish to
> reconstruct the sequence of events in chronological order, i.e.
> retrieve the earliest data recorded from all the sources, followed by
> the next and so on. I'm wondering how to approach this. I was thinking
> of creating a single large table into which I would insert the
> timestamp and ID of every row from each source as well as the table
> name, then ceating a clustered index on the timestamp field, so that
> the rows would be physically sorted in chronlogical order. This table
> may look like the following when complete
> 1 09/22/2004 16:00:00 Table1
> 2 09/22/2004 16:00:01 Table1
> 1 09/22/2004 16:00:02 Table2
> 3 09/22/2004 16:00:03 Table1
> 2 09/22/2004 16:00:03 Table2
> etc.
> This would only ever have to happen once (I appreciate it may take
> some time to construct) - then each time I need to I could simply move
> through the table row by row retrieving the ID and table name which I
> could then use as parameters in a query that would retrieve the data.
> I'd appreciate any feedback on this approach i.e. is it madness! I was
> wondering about views etc. but I don't know if these could help at all
> Thanks,
> Paul|||Hi Scott,
Thanks for the reply. Apologies for any confusion arising from my use
of the term timestamp - it does not refer to the sql data type in this
case, rather to the datetime at which the source produced the data -
which is provided by the source itself. Therefore the limitations in
terms of sql server accuracy you refer to do not apply in this case.
Limitations in terms of accuracy emanate from the sources themselves
and I don't wish to concern myself with them at this point as much as
to say that action has been taken to ensure the are synchronised as
much as possible. I did take the precaution (whether right or wrong)
of recording however the millisecond portion of the 'timestamp' to a
separate field to maintain its integrity. I note your point on the
clustered index/ordering thanks. I think that the order in which the
data was produced by the source can be determined from the data itself
- notably the timestamp. The physical insertion order does not concern
me but perhaps if needed I could use an identity field - but if a
single thread is involved as is the case this is a mute point (I
think). Regarding identifying chronological order you are correct when
you ask where can the database assist in differentiating between two
records with the same timestamp/datetimes but I'm thinking it doesn't
have to. I'm only interested in it differentiating between records
with different datetimes i.e. a record with a datetime preceding
another one will be higher up in the order. When I replay the data in
sequence and I encounter two records with the same datetime the delay
in displaying them will be down to how quickly they can be processed
and when you are talking about portions of a second the result will
often be invisible to the naked eye - not perfect but not critical
either. As far as using a non-relational system that's out of the
question at this point.
Again thanks for the reply, I found it useful and appreciate your
time. Perhaps I will refrain from asking such general questions in
future - I think they can raise more questions then they answer - a
problem hard to address through this medium.
Paul
Assume that source A
> has data with timestamp 10:22:01 (ignoring the date portion for simplicity)
> and source B has data with the same timestamp (you said this was possible).
> They both get inserted into the DB (regardless of table) at roughly the same
> time. Based on data alone, it is not possible to determine which is first.
> What does the DB have to assist?
"Scott Morris" <bogus@.bogus.com> wrote in message news:<uNMwmBNoEHA.3324@.TK2MSFTNGP12.phx.gbl>...
> First, timestamp has no relationship to date or time; the synonym is
> rowversion. It is unique for a db and appears to be currently be
> implemented in such a way as to be predictable. Based on the current
> implementation, one can make comparisons (involving greater than and less
> than) that give correct results. However, I don't think the current
> implementation can be safely relied on and, IIRC, MS only supports equal /
> not equal comparisons. Note - since you are posting in a db-related NG, I
> assume that your reference to timestamp is actually a reference to the sql
> server datatype. If this assumption is incorrect - well that's your fault
> for using an ambiguous term outside of the implied context.
> Second, datetime values are accurate to 3 milliseconds. If your term
> "timestamp" refers to a datatype of this nature, is this accuracy
> sufficient? Note that there are two aspects to accuracy. First is the
> relative difference in accuracy between your datasource timestamp values and
> this datatype. The second aspect is relative to a difference in accuracy.
> Your timestamp data might be more accurate (e.g., 1 ms) but occurs at
> intervals that are significantly less accurate (e.g., > 3 ms). Of course,
> you could use a different datatype, losing the ability to use the builtin
> datetime functions and creating potential ordering problems.
> Third, a table is, be definition, unordered. Your narrative assumes and
> implies otherwise. Clustering does affect physical ordering. However,
> there is nothing that can guarantee you can access rows in physical
> insertion order unless that order can be determined by the data itself. In
> other words, you must provide a way to include the appropriate information
> in an order by clause. Even with a heap and a cursor, I'm not certain that
> there is any way to guarantee the desired order.
> Given this information, how do you intend to store the data (regardless of
> number of tables) in such a way that one can reconstruct the events in
> chrono. order? There are two implications in your narrative that might be
> problematic. First, is that "reconstruction" implies a single thread of
> execution. Is this valid? If not, then you might also need to record
> additional information to differentiate the multiple streams of data (which
> just might be adding data at the same instant in time). The other potential
> pitfall is that reconstruction often implies "re-running" (e.g.,
> reconstruction of the data in the same order/timeframe). Fear this!
> So now we're down to identifying chronological order. Assume that source A
> has data with timestamp 10:22:01 (ignoring the date portion for simplicity)
> and source B has data with the same timestamp (you said this was possible).
> They both get inserted into the DB (regardless of table) at roughly the same
> time. Based on data alone, it is not possible to determine which is first.
> What does the DB have to assist? You could use a timestamp for the table
> (or tables) and rely on the current implementation to derive order. Somone
> recently posted that this type of solution was working. You could use a
> datetime column to mark each row with the datetime of insert. This approach
> may suffer due to accuracy limitations. Lastly, you could use a single
> table with an identity. This should logically work (and it is what you
> considered) but may be problematic due to contention / locking. Hotspots
> will definitely be something to avoid. Note that this approach is also
> predicated on the immediate insertion of data, implying that every data
> "event" is recorded on a one-by-one basis. This is not the best approach in
> terms of network usage (and does not scale well).
> You might want to investigate any potential client-side approaches as well.
> In a single-threaded application, you could also impose your own serial
> numbering on the data before insertion.
> Lastly, you might want to indicate what your ultimate goal is. Relational
> databases are often slower than file-based approaches for data of this type
> (there is a lot of overhead that you may not need for recording data).
> There are other ways to record "data" in fifo order that might later to
> imported into a real database for analysis or reporting. If there are no
> real relationships in your data, then perhaps you don't need a rdbms.
> Without knowing what you intend to do with the information, this is the
> approach I would investigate first.
> "Paul" <paulsmith5@.hotmail.com> wrote in message
> news:ca236fb1.0409220850.2058abc9@.posting.google.com...
> > Hi,
> >
> > Apologies for the vague subject title but I couldn't think of a snappy
> > title to describe what I am doing. Please bear with me as I explain...
> >
> > I am developing an application that records data from several
> > different sources operating at relatively high rates (e.g. >5 times
> > per sec). The data from the various sources is diverse, in terms of
> > the type (e.g. data from source A may comprise of 5 parameters in
> > binary format, whereas data from source B may comprise of a single
> > parameter in ascii format). The one thing they have in common is a
> > timestamp. The timestamp although unique for each source is not unique
> > amongst all sources. For example although there will be only ever be a
> > single row in TableA with a particular timestamp representing a
> > particular point in time, there may also be a row in TableB with the
> > same timestamp. In an attempt to have the writes as efficient as
> > possible and to prevent the tables growing too big I record data from
> > each source to different tables. These tables are created at runtime
> > before I begin to record the data. Following a period of recording
> > some tables may still have in excess of 500,000 rows. I now wish to
> > reconstruct the sequence of events in chronological order, i.e.
> > retrieve the earliest data recorded from all the sources, followed by
> > the next and so on. I'm wondering how to approach this. I was thinking
> > of creating a single large table into which I would insert the
> > timestamp and ID of every row from each source as well as the table
> > name, then ceating a clustered index on the timestamp field, so that
> > the rows would be physically sorted in chronlogical order. This table
> > may look like the following when complete
> >
> > 1 09/22/2004 16:00:00 Table1
> > 2 09/22/2004 16:00:01 Table1
> > 1 09/22/2004 16:00:02 Table2
> > 3 09/22/2004 16:00:03 Table1
> > 2 09/22/2004 16:00:03 Table2
> > etc.
> >
> > This would only ever have to happen once (I appreciate it may take
> > some time to construct) - then each time I need to I could simply move
> > through the table row by row retrieving the ID and table name which I
> > could then use as parameters in a query that would retrieve the data.
> > I'd appreciate any feedback on this approach i.e. is it madness! I was
> > wondering about views etc. but I don't know if these could help at all
> >
> > Thanks,
> >
> > Paul|||It's very difficult to answer general questions in a NG. Context usually
has direct bearing on the issues and the context is difficult to communicate
(as well as lengthy). From the short discussion, it looks like you need to
conduct a thorough analysis of the data (a good bit you've already done).
The characteristics of the data will lead (or force!) you to the appropriate
design. Just be aware of the datatype issues (some of which I mentioned)
and their potential effects on your system. One last comment - you might
want to conduct some testing involving a typically loaded machine / network
to verify that it can handle the expected volume of inserts. Better to know
this ahead of time (where you can design around it) then discover it
afterwards.
Good luck.
"Paul" <paulsmith5@.hotmail.com> wrote in message
news:ca236fb1.0409230256.6a4876b0@.posting.google.com...
> Hi Scott,
> Thanks for the reply. Apologies for any confusion arising from my use
> of the term timestamp - it does not refer to the sql data type in this
> case, rather to the datetime at which the source produced the data -
> which is provided by the source itself. Therefore the limitations in
> terms of sql server accuracy you refer to do not apply in this case.
> Limitations in terms of accuracy emanate from the sources themselves
> and I don't wish to concern myself with them at this point as much as
> to say that action has been taken to ensure the are synchronised as
> much as possible. I did take the precaution (whether right or wrong)
> of recording however the millisecond portion of the 'timestamp' to a
> separate field to maintain its integrity. I note your point on the
> clustered index/ordering thanks. I think that the order in which the
> data was produced by the source can be determined from the data itself
> - notably the timestamp. The physical insertion order does not concern
> me but perhaps if needed I could use an identity field - but if a
> single thread is involved as is the case this is a mute point (I
> think). Regarding identifying chronological order you are correct when
> you ask where can the database assist in differentiating between two
> records with the same timestamp/datetimes but I'm thinking it doesn't
> have to. I'm only interested in it differentiating between records
> with different datetimes i.e. a record with a datetime preceding
> another one will be higher up in the order. When I replay the data in
> sequence and I encounter two records with the same datetime the delay
> in displaying them will be down to how quickly they can be processed
> and when you are talking about portions of a second the result will
> often be invisible to the naked eye - not perfect but not critical
> either. As far as using a non-relational system that's out of the
> question at this point.
> Again thanks for the reply, I found it useful and appreciate your
> time. Perhaps I will refrain from asking such general questions in
> future - I think they can raise more questions then they answer - a
> problem hard to address through this medium.
> Paul
>
> Assume that source A
> > has data with timestamp 10:22:01 (ignoring the date portion for
simplicity)
> > and source B has data with the same timestamp (you said this was
possible).
> > They both get inserted into the DB (regardless of table) at roughly the
same
> > time. Based on data alone, it is not possible to determine which is
first.
> > What does the DB have to assist?
> "Scott Morris" <bogus@.bogus.com> wrote in message
news:<uNMwmBNoEHA.3324@.TK2MSFTNGP12.phx.gbl>...
> > First, timestamp has no relationship to date or time; the synonym is
> > rowversion. It is unique for a db and appears to be currently be
> > implemented in such a way as to be predictable. Based on the current
> > implementation, one can make comparisons (involving greater than and
less
> > than) that give correct results. However, I don't think the current
> > implementation can be safely relied on and, IIRC, MS only supports equal
/
> > not equal comparisons. Note - since you are posting in a db-related NG,
I
> > assume that your reference to timestamp is actually a reference to the
sql
> > server datatype. If this assumption is incorrect - well that's your
fault
> > for using an ambiguous term outside of the implied context.
> >
> > Second, datetime values are accurate to 3 milliseconds. If your term
> > "timestamp" refers to a datatype of this nature, is this accuracy
> > sufficient? Note that there are two aspects to accuracy. First is the
> > relative difference in accuracy between your datasource timestamp values
and
> > this datatype. The second aspect is relative to a difference in
accuracy.
> > Your timestamp data might be more accurate (e.g., 1 ms) but occurs at
> > intervals that are significantly less accurate (e.g., > 3 ms). Of
course,
> > you could use a different datatype, losing the ability to use the
builtin
> > datetime functions and creating potential ordering problems.
> >
> > Third, a table is, be definition, unordered. Your narrative assumes and
> > implies otherwise. Clustering does affect physical ordering. However,
> > there is nothing that can guarantee you can access rows in physical
> > insertion order unless that order can be determined by the data itself.
In
> > other words, you must provide a way to include the appropriate
information
> > in an order by clause. Even with a heap and a cursor, I'm not certain
that
> > there is any way to guarantee the desired order.
> >
> > Given this information, how do you intend to store the data (regardless
of
> > number of tables) in such a way that one can reconstruct the events in
> > chrono. order? There are two implications in your narrative that might
be
> > problematic. First, is that "reconstruction" implies a single thread of
> > execution. Is this valid? If not, then you might also need to record
> > additional information to differentiate the multiple streams of data
(which
> > just might be adding data at the same instant in time). The other
potential
> > pitfall is that reconstruction often implies "re-running" (e.g.,
> > reconstruction of the data in the same order/timeframe). Fear this!
> >
> > So now we're down to identifying chronological order. Assume that
source A
> > has data with timestamp 10:22:01 (ignoring the date portion for
simplicity)
> > and source B has data with the same timestamp (you said this was
possible).
> > They both get inserted into the DB (regardless of table) at roughly the
same
> > time. Based on data alone, it is not possible to determine which is
first.
> > What does the DB have to assist? You could use a timestamp for the
table
> > (or tables) and rely on the current implementation to derive order.
Somone
> > recently posted that this type of solution was working. You could use a
> > datetime column to mark each row with the datetime of insert. This
approach
> > may suffer due to accuracy limitations. Lastly, you could use a single
> > table with an identity. This should logically work (and it is what you
> > considered) but may be problematic due to contention / locking.
Hotspots
> > will definitely be something to avoid. Note that this approach is also
> > predicated on the immediate insertion of data, implying that every data
> > "event" is recorded on a one-by-one basis. This is not the best
approach in
> > terms of network usage (and does not scale well).
> >
> > You might want to investigate any potential client-side approaches as
well.
> > In a single-threaded application, you could also impose your own serial
> > numbering on the data before insertion.
> >
> > Lastly, you might want to indicate what your ultimate goal is.
Relational
> > databases are often slower than file-based approaches for data of this
type
> > (there is a lot of overhead that you may not need for recording data).
> > There are other ways to record "data" in fifo order that might later to
> > imported into a real database for analysis or reporting. If there are
no
> > real relationships in your data, then perhaps you don't need a rdbms.
> > Without knowing what you intend to do with the information, this is the
> > approach I would investigate first.
> >
> > "Paul" <paulsmith5@.hotmail.com> wrote in message
> > news:ca236fb1.0409220850.2058abc9@.posting.google.com...
> > > Hi,
> > >
> > > Apologies for the vague subject title but I couldn't think of a snappy
> > > title to describe what I am doing. Please bear with me as I explain...
> > >
> > > I am developing an application that records data from several
> > > different sources operating at relatively high rates (e.g. >5 times
> > > per sec). The data from the various sources is diverse, in terms of
> > > the type (e.g. data from source A may comprise of 5 parameters in
> > > binary format, whereas data from source B may comprise of a single
> > > parameter in ascii format). The one thing they have in common is a
> > > timestamp. The timestamp although unique for each source is not unique
> > > amongst all sources. For example although there will be only ever be a
> > > single row in TableA with a particular timestamp representing a
> > > particular point in time, there may also be a row in TableB with the
> > > same timestamp. In an attempt to have the writes as efficient as
> > > possible and to prevent the tables growing too big I record data from
> > > each source to different tables. These tables are created at runtime
> > > before I begin to record the data. Following a period of recording
> > > some tables may still have in excess of 500,000 rows. I now wish to
> > > reconstruct the sequence of events in chronological order, i.e.
> > > retrieve the earliest data recorded from all the sources, followed by
> > > the next and so on. I'm wondering how to approach this. I was thinking
> > > of creating a single large table into which I would insert the
> > > timestamp and ID of every row from each source as well as the table
> > > name, then ceating a clustered index on the timestamp field, so that
> > > the rows would be physically sorted in chronlogical order. This table
> > > may look like the following when complete
> > >
> > > 1 09/22/2004 16:00:00 Table1
> > > 2 09/22/2004 16:00:01 Table1
> > > 1 09/22/2004 16:00:02 Table2
> > > 3 09/22/2004 16:00:03 Table1
> > > 2 09/22/2004 16:00:03 Table2
> > > etc.
> > >
> > > This would only ever have to happen once (I appreciate it may take
> > > some time to construct) - then each time I need to I could simply move
> > > through the table row by row retrieving the ID and table name which I
> > > could then use as parameters in a query that would retrieve the data.
> > > I'd appreciate any feedback on this approach i.e. is it madness! I was
> > > wondering about views etc. but I don't know if these could help at all
> > >
> > > Thanks,
> > >
> > > Paul
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment